# Gentle introduction to Ray datasets APIs

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives

In this introductory tutorial you will learn to:
 * create, transform, read, and save Ray datasets
 * use shards for parallel processing of large datasets
 * understand datapipelines and their merits
 * use `DatasetPipeline` for parallel computation 
 * use datasets for last-mile ML ingestion for distributed training
 * why use datasets and what for

### Overview

This is a brief and gentle introduction to Ray's native library `ray dataset`. As a native Ray library, built atop Ray, it allows you to exchange data among Ray tasks, actors, libraries, and applications. It also allows you to read/write training data from different file sources, include csv, parquet, text, etc.

Ray Datasets, using distributed [Apache Arrow](https://arrow.apache.org/), are designed to load and preprocess data for distributed ML training pipelines. Compared to other loading solutions, Datasets are more flexible (e.g., you can express higher-quality per-epoch global shuffles) and provides higher overall performance.

Additionally, Ray datasets provides standard and simple transformations like `map`, `filter`, and `partition`. Ray datasets is *not* a replacement for a full-fledged data processing library for doing exploratory data analysis (EDA), extract, transform and load (ETL) or a subsitute for Apache Spark or Dask or Pandas DataFrames. Its primary objective is the last-mile rudimentary distributed data preprocessing and data ingestion for ML training.

Supporting myriad [file formats and data sources](https://docs.ray.io/en/latest/data/dataset.html#datasource-compatibility), you can read from and write to local FS and cloud storage. 

<img src="images/dataset.png" width="70%" height="35%">


### Key concepts

To work with Ray Datasets, you need to understand how Datasets and Dataset Pipelines work. That is, how datasets are stored internally and in what format. And what benefit does Datapipelines offer for faster processing and execution. A quick peek into each of these will shed some light into overall benefits of Ray Datasets.

Let's start with the internal format. 

#### Ray Datasets

A Ray dataset implements a distributed [Apache Arrow](https://arrow.apache.org/). As such, a Dataset consists of a list of Ray object references to blocks. Each block holds a set of items in either an [Arrow table](https://arrow.apache.org/docs/python/data.html#tables) or a Python list (for Arrow incompatible objects).

<img src="images/dataset-arch.png" width="70%" height="35%">

#### Dataset Pipelines
Datasets execute their transformations synchronously in blocking calls. However, it can be useful to overlap dataset computations with output. This can be done with a `DatasetPipeline`.

A `DatasetPipeline` is an unified iterator over a (potentially infinite) sequence of Ray Datasets, each of which represents a window over the original data. Conceptually, it is similar to a `Spark DStream`, but manages execution over a bounded amount of source data instead of an unbounded stream. Ray computes each dataset window on-demand and stitches their output together into a single logical data iterator. `DatasetPipeline` implements most of the same transformation and output methods as Datasets (e.g., `map`, `filter`, `split`, `iter_rows`, `to_torch`, etc.).

### Datasets Execution Model
This section overviews the execution model of Datasets, which may be useful for understanding and tuning performance.

#### Reading Data
Datasets uses Ray tasks, for parallelism, to read data from remote storage or source. When reading from a file-based datasource (e.g., S3, GCS), it creates a number of parallel
read tasks equal to the specified read parallelism (200 by default). One or more files will be assigned to each read task. Each read task reads its assigned files and produces one or more output blocks (Ray objects):

<img src="https://docs.ray.io/en/master/_images/dataset-read.svg" height="25%" width="50%">

In the common case, each read task produces a single output block. Read tasks may split the output into multiple blocks if the data exceeds the target max block size (2GiB by default). This automatic block splitting avoids out-of-memory errors when reading very large single files (e.g., a 100-gigabyte CSV file). All of the built-in datasources except for JSON currently support automatic block splitting.

#### Deferred Read Task Execution

When a Dataset is created using `ray.data.read_*`, only the first read task will be executed initially. This avoids blocking Dataset creation on the reading of all data files, enabling inspection functions like `ds.schema()` without incurring high read costs. `<ray.data.Dataset.schema>`() and `ds.show()` can be used right away. Executing further transformations on the Dataset will trigger execution of all read tasks.

#### Dataset Transforms

Datasets use either Ray tasks or Ray actors to transform datasets (i.e., for `ds.map_batches()`, `ds.map()`, or `ds.flat_map()`). By default, tasks are used `(compute="tasks")`. Actors can be specified with `compute="actors"`, in which case an autoscaling pool of Ray actors will be used to apply transformations. Using actors allows for expensive state initialization (e.g., for GPU-based tasks) to be re-used. Whichever compute strategy is used, each map task generally takes in one block and produces one or more output blocks. The output block splitting rule is the same as for file reads (blocks are split after hitting the target max block size of 2GiB):

<img src="https://docs.ray.io/en/master/_images/dataset-map.svg" height="25%" width="50%">

#### Shuffling Data

Certain operations like `ds.sort()` and `ds.groupby()` require data blocks to be partitioned by value. Datasets executes this in three phases. First, a wave of sampling tasks determines suitable partition boundaries based on a random sample of data. Second, map tasks divide each input block into a number of output blocks equal to the number of reduce tasks. Third, reduce tasks take assigned output blocks from each map task and combines them into one block. Overall, this strategy generates O(n^2) intermediate objects where n is the number of input blocks.

You can also change the partitioning of a Dataset using `ds.random_shuffle()` or `ds.repartition()`. The former should be used if you want to randomize the order of elements in the dataset. The second should be used if you only want to equalize the size of the Dataset blocks (e.g., after a read or transformation that may skew the distribution of block sizes). Note that repartition has two modes, `shuffle=False`, which performs the minimal data movement needed to equalize block sizes, and `shuffle=True`, which performs a full (non-random) distributed shuffle:

<img src="https://docs.ray.io/en/master/_images/dataset-shuffle.svg" height="25%" width="50%">

#### Fault tolerance

Datasets relies on task-based [fault tolerance](https://docs.ray.io/en/latest/ray-core/tasks/fault-tolerance.html) in Ray core. Specifically, a `Dataset` will be automatically recovered by Ray in case of failures. This works through **lineage reconstruction**: a Dataset is a collection of Ray objects stored in shared memory, and if any of these objects are lost, then Ray will recreate them by re-executing the task(s) that created them.

There are a few cases that are not currently supported: 1. If the original creator of the Dataset dies. This is because the creator stores the metadata for the objects that comprise the Dataset. 2. `For a DatasetPipeline.split()`, we do not support recovery for a consumer failure. When there are multiple consumers, they must all read the split pipeline in lockstep. To recover from this case, the pipeline and all consumers must be restarted together. 3. The `compute=actors` option for transformations.

#### Execution and Memory Management

See [Execution and Memory Management](https://docs.ray.io/en/master/data/memory-management.html#data-advanced) for more details about how Datasets manages memory and optimizations such as lazy vs eager execution.

In [1]:
import logging, os, random, warnings
import ray

In [2]:
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

In [3]:
if ray.is_initialized:
    ray.shutdown()
ctx = ray.init(logging_level=logging.ERROR)
print(ctx)

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.13', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-07-11_22-08-32_423805_50280/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-07-11_22-08-32_423805_50280/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-07-11_22-08-32_423805_50280', 'metrics_export_port': 61166, 'gcs_address': '127.0.0.1:58235', 'address': '127.0.0.1:58235', 'node_id': '1a1dc4771d109d4b868c0d9f111d11f787a55803e95168d24d784259'})


In [4]:
print(f"Dashboard url: http://{ctx.address_info['webui_url']}")

Dashboard url: http://127.0.0.1:8265


### Creating a simple Ray Dataset

Let's create a generic dataset of 100K integers and look at the schema and underlying datatype. The difference between `show` and `take` is that the former takes one item at time and prints it, while the latter iterates over row items from the dataset, appends to a list and returns it. Underneath, `ds.show()` calls `ds.take()`.

In [5]:
ds = ray.data.range(100_000)
ds.count()

100000

In [6]:
ds.schema()

int

In [7]:
ds.show(5)

0
1
2
3
4


In [8]:
ds.take(5)

[0, 1, 2, 3, 4]

### Creating a large Ray Dataset

Let's create a synthetic dataset, *Homeowners*, of Arrow records (750K) with several columns and data associated with it. 

To illustrate some simple transformational functions, we'll use this generated 
data

In [9]:
STATES = ["CA", "AZ", "OR", "WA", "TX", "UT"]
M_STATUS = ["married", "single", "domestic", "divorced", "undeclared"]
GENDER = ["F", "M", "U"]
HOME_OWNER = ["condo", "house", "rental"]

items = [{"id": i,
          "ssn": None,
          "name": None,
          "amount": i * 1.5, 
          "interest": random.randint(1,5) * .1,
          "state": random.choice(STATES),
          "marital_status": random.choice(M_STATUS),
          "property": random.choice(HOME_OWNER),
          "dependents": random.randint(1, 5),
          "defaulted": random.randint(0,1),
          "gender":random.choice(GENDER) } for i in range(1,750_001)]
items[:2]

[{'id': 1,
  'ssn': None,
  'name': None,
  'amount': 1.5,
  'interest': 0.30000000000000004,
  'state': 'UT',
  'marital_status': 'single',
  'property': 'rental',
  'dependents': 1,
  'defaulted': 0,
  'gender': 'M'},
 {'id': 2,
  'ssn': None,
  'name': None,
  'amount': 3.0,
  'interest': 0.1,
  'state': 'CA',
  'marital_status': 'undeclared',
  'property': 'house',
  'dependents': 3,
  'defaulted': 1,
  'gender': 'M'}]

#### Creating a dataset from list of dictionary items

Ray data can be created of a dictionary of items. 

In [10]:
arrow_ds = ray.data.from_items(items)
arrow_ds

Dataset(num_blocks=200, num_rows=750000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string})

In [11]:
arrow_ds.count()

750000

In [12]:
arrow_ds.take(1)

[ArrowRow({'id': 1,
           'ssn': None,
           'name': None,
           'amount': 1.5,
           'interest': 0.30000000000000004,
           'state': 'UT',
           'marital_status': 'single',
           'property': 'rental',
           'dependents': 1,
           'defaulted': 0,
           'gender': 'M'})]

In [13]:
arrow_ds.schema()

id: int64
ssn: null
name: null
amount: double
interest: double
state: string
marital_status: string
property: string
dependents: int64
defaulted: int64
gender: string

### Saving datasets and reading as a parquet file
Ray datasets support myriad data formats and public storage. Let's save this dataset as a parquet file and create `N` partitions

In [14]:
arrow_ds.repartition(5).write_parquet("data_homeowners/interest.parquet")

Repartition:   0%|                                                                                                                      | 0/5 [00:00<?, ?it/s][2m[36m(_execute_read_task pid=50773)[0m E0711 22:13:21.997592000 6215544832 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Repartition: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.73it/s]
Write Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 28.99it/s]


In [15]:
!ls -l data_homeowners/interest.parquet

total 20856
-rw-r--r--  1 jules  staff  2139602 Jul 11 22:13 e985818c43a44346ab1b59fc6d90ba54_000000.parquet
-rw-r--r--  1 jules  staff  2121397 Jul 11 22:13 e985818c43a44346ab1b59fc6d90ba54_000001.parquet
-rw-r--r--  1 jules  staff  2119623 Jul 11 22:13 e985818c43a44346ab1b59fc6d90ba54_000002.parquet
-rw-r--r--  1 jules  staff  2119273 Jul 11 22:13 e985818c43a44346ab1b59fc6d90ba54_000003.parquet
-rw-r--r--  1 jules  staff  2169557 Jul 11 22:13 e985818c43a44346ab1b59fc6d90ba54_000004.parquet


In [17]:
arrow_ds = ray.data.read_parquet("data_homeowners/interest.parquet")

In [18]:
arrow_ds.take(1)

[ArrowRow({'id': 1,
           'ssn': None,
           'name': None,
           'amount': 1.5,
           'interest': 0.30000000000000004,
           'state': 'UT',
           'marital_status': 'single',
           'property': 'rental',
           'dependents': 1,
           'defaulted': 0,
           'gender': 'M'})]

### Transforming data with simple methods

Ray datasets support transformation in parallel using `map`. It uses Ray tasks to execute eagerly or synchronously. Among others [transformations](https://docs.ray.io/en/latest/data/package-ref.html#dataset-api), it supports`filter`, `flat_map`, `groupBy`etc.

Let's try a using `.map()`, `.filter()` and `.groupBy` on our dataset. The `map()` and `filter()` are
row-based operations. This can be expensive for large datasets. However you can use `map_batches(...)` with batch_size=4096 as default. This will create a task per block and each batch will be vectorized and executed in parallel. Ray tasks are created per block for a map operation. 

Let's try first with row-based

In [19]:
%%time
arrow_ds.filter(lambda x: x['amount'] > 10000).take(3)

Read->Filter: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.25it/s]

CPU times: user 51.3 ms, sys: 20.8 ms, total: 72.2 ms
Wall time: 4.03 s





[ArrowRow({'id': 6667,
           'ssn': None,
           'name': None,
           'amount': 10000.5,
           'interest': 0.2,
           'state': 'TX',
           'marital_status': 'divorced',
           'property': 'condo',
           'dependents': 5,
           'defaulted': 1,
           'gender': 'M'}),
 ArrowRow({'id': 6668,
           'ssn': None,
           'name': None,
           'amount': 10002.0,
           'interest': 0.2,
           'state': 'AZ',
           'marital_status': 'divorced',
           'property': 'rental',
           'dependents': 5,
           'defaulted': 0,
           'gender': 'F'}),
 ArrowRow({'id': 6669,
           'ssn': None,
           'name': None,
           'amount': 10003.5,
           'interest': 0.5,
           'state': 'AZ',
           'marital_status': 'single',
           'property': 'condo',
           'dependents': 4,
           'defaulted': 0,
           'gender': 'U'})]

Let's try a `.map_batches()`, which is vectorized 

In [20]:
%%time
arrow_ds.map_batches(lambda df: df[df["amount"] > 10000]).take(3)

Read->Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 16.66it/s]


CPU times: user 39.4 ms, sys: 13.8 ms, total: 53.2 ms
Wall time: 329 ms


[PandasRow({'id': 6667,
            'ssn': None,
            'name': None,
            'amount': 10000.5,
            'interest': 0.2,
            'state': 'TX',
            'marital_status': 'divorced',
            'property': 'condo',
            'dependents': 5,
            'defaulted': 1,
            'gender': 'M'}),
 PandasRow({'id': 6668,
            'ssn': None,
            'name': None,
            'amount': 10002.0,
            'interest': 0.2,
            'state': 'AZ',
            'marital_status': 'divorced',
            'property': 'rental',
            'dependents': 5,
            'defaulted': 0,
            'gender': 'F'}),
 PandasRow({'id': 6669,
            'ssn': None,
            'name': None,
            'amount': 10003.5,
            'interest': 0.5,
            'state': 'AZ',
            'marital_status': 'single',
            'property': 'condo',
            'dependents': 4,
            'defaulted': 0,
            'gender': 'U'})]

You can see that `.map_batches()` is a lot faster than row based. So for large datasets use 
`.map_batches()`

Let's try a filter operation: both per row operation and per block as vectorized

In [21]:
%%time
arrow_ds.filter(lambda x: x['amount'] > 10000.00 and x['state'] == 'CA').take(2)

Read->Filter: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.97it/s]

CPU times: user 30 ms, sys: 12.1 ms, total: 42.1 ms
Wall time: 1.7 s





[ArrowRow({'id': 6670,
           'ssn': None,
           'name': None,
           'amount': 10005.0,
           'interest': 0.5,
           'state': 'CA',
           'marital_status': 'single',
           'property': 'rental',
           'dependents': 4,
           'defaulted': 1,
           'gender': 'F'}),
 ArrowRow({'id': 6674,
           'ssn': None,
           'name': None,
           'amount': 10011.0,
           'interest': 0.4,
           'state': 'CA',
           'marital_status': 'married',
           'property': 'rental',
           'dependents': 5,
           'defaulted': 1,
           'gender': 'F'})]

In [22]:
%%time
arrow_ds.map_batches(lambda df: df[[df["amount"] > 10000] and df["state"] == "CA"]).take(3)

Read->Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 36.76it/s]

CPU times: user 29.9 ms, sys: 10.2 ms, total: 40.1 ms
Wall time: 159 ms





[PandasRow({'id': 2,
            'ssn': None,
            'name': None,
            'amount': 3.0,
            'interest': 0.1,
            'state': 'CA',
            'marital_status': 'undeclared',
            'property': 'house',
            'dependents': 3,
            'defaulted': 1,
            'gender': 'M'}),
 PandasRow({'id': 4,
            'ssn': None,
            'name': None,
            'amount': 6.0,
            'interest': 0.5,
            'state': 'CA',
            'marital_status': 'domestic',
            'property': 'rental',
            'dependents': 2,
            'defaulted': 1,
            'gender': 'F'}),
 PandasRow({'id': 8,
            'ssn': None,
            'name': None,
            'amount': 12.0,
            'interest': 0.30000000000000004,
            'state': 'CA',
            'marital_status': 'married',
            'property': 'house',
            'dependents': 2,
            'defaulted': 0,
            'gender': 'U'})]

Use `groupBy` state and compute the count

Under the hood is distributed parallel group and vectorized not using UDFs

In [23]:
results = arrow_ds.groupby("state").count()

Read: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 208.96it/s]
Sort Sample: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1973.98it/s]
Shuffle Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.19it/s]
Shuffle Reduce: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1269.62it/s]


In [24]:
results.show()

{'state': 'AZ', 'count()': 125395}
{'state': 'CA', 'count()': 124339}
{'state': 'OR', 'count()': 124197}
{'state': 'TX', 'count()': 125754}
{'state': 'UT', 'count()': 125485}
{'state': 'WA', 'count()': 124830}


Get the max of these columns

In [25]:
results=arrow_ds.max(["amount", "interest", "dependents"])
results

Read: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 216.87it/s]
Shuffle Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 659.27it/s]
Shuffle Reduce: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 345.44it/s]


ArrowRow({'max(amount)': 1125000.0,
          'max(interest)': 0.5,
          'max(dependents)': 5})

### Accessing datasets using batches or iterating by rows

Datasets can be passed to Ray tasks or actors and read with `.iter_batches()` or `.iter_rows()`. This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects. Splitting data as shards and passing to individual Ray Actors to process shards in a common Ray pattern used in distributed training with Ray actors.

Let's examine how we can process a list of shards with a `BatchWorker` Actor  in a distributed fashion

<img src="images/batch_worker.jpg" width="80%" height="35%">

In [26]:
@ray.remote
class BatchWorker:
    def __init__(self, rank):
        self.rank = rank
        self.processed = 0
    
    @ray.method(num_returns=2)
    def process_shard_list(self, shard: ray.data.Dataset) -> tuple:
        for batch in shard.iter_batches(batch_size=1024):
            # here you could do something with the batch such as feature
            # processing, transformation, and 
            # save as a parquet files 
            self.processed = self.processed + len(batch)
        # return items processed, worker id
        return (self.processed, self.rank)     

#### Create batch workers as Ray actors
Each actor will get a shard, list of rows, to work on. We split
our dataset `arrow_ds` into five shards. Each `BatchWorker` gets a shard.
`.split`() splits shards across these batch of workers by using the `locality_hints`

In [27]:
batch_workers = [BatchWorker.remote(i) for i in range(1, 6)]

shards = arrow_ds.split(len(batch_workers), locality_hints=batch_workers)

print(f"Shard row: {shards[0]}")
print(f"Number of shards:{len(shards)}")
print(f"Number of shard workers:{len(batch_workers)}")

Read progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 142.52it/s]

Shard row: Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string})
Number of shards:5
Number of shard workers:5





### Launch `BatchWorker` actors

Process each shard. Each `BatchWorker.process_shard_list()` returns a object RefID with a tuple as its value. What we get from this comprehension is a list objectRefs as tuples.

In [28]:
object_refs = [w.process_shard_list.remote(s) for w, s in zip(batch_workers, shards)]
object_refs, len(object_refs)

([[ObjectRef(5f70e045687d2f9af4ace96163e8c6fa53bbf3ae0100000001000000),
   ObjectRef(5f70e045687d2f9af4ace96163e8c6fa53bbf3ae0100000002000000)],
  [ObjectRef(a4dc031465f905f85008f36129600f8fb1f77eb10100000001000000),
   ObjectRef(a4dc031465f905f85008f36129600f8fb1f77eb10100000002000000)],
  [ObjectRef(9e7872a82e7456d9d43a11c5282a398555bdad870100000001000000),
   ObjectRef(9e7872a82e7456d9d43a11c5282a398555bdad870100000002000000)],
  [ObjectRef(cd25e647a728676b33635ef154b9167d239916360100000001000000),
   ObjectRef(cd25e647a728676b33635ef154b9167d239916360100000002000000)],
  [ObjectRef(57f023b5f2c83c933e245432417ea6404d8b57c30100000001000000),
   ObjectRef(57f023b5f2c83c933e245432417ea6404d8b57c30100000002000000)]],
 5)

Fetch the values from the returned list of ObjectRefs, which is a tuple of (batch_size, worker_rank).

In [29]:
values = [ray.get(ref) for ref in object_refs]
values

[[150000, 1], [150000, 2], [150000, 3], [150000, 4], [150000, 5]]

### Creating and using Ray dataset pipelines

What are dataset pipelines and how are they different from Ray datasets? 

Datasets perform transformation or operations eagerly or synchronously, whereas [DataPipelines](https://docs.ray.io/en/latest/data/package-ref.html#datasetpipeline-api) can execute in an overlapped pipeline executions. For example, if you had operations that require reading from file, transforming data, and then doing some minor feature engineering, these operations can be executed in a normal pipeline fashion. This allows for the overlapped execution of data input (e.g., reading files), computation (e.g. feature preprocessing), and training (e.g., distributed ML training).

A `DatasetPipeline` can be constructed in two ways: either by pipelining the execution of an existing Dataset (via `Dataset.window`) or generating repeats of an existing Dataset (via `Dataset.repeat`). 

Let's have a go at it and see what we can do with our synthetic data from above.


### Using Dataset.window

Create some functions or operations to be executed in a overlapped manner in the pipeline. These functions
are simple to illustrate a point. But they can be complex for a particular use case.


In [30]:
def divide_row_value(row: ray.data.impl.arrow_block.ArrowRow, n) -> int:
    return round(row / n)

In [31]:
def double_row_value(row: ray.data.impl.arrow_block.ArrowRow, n) -> int:
    return row * n

In [32]:
def modulo_row_value(row: ray.data.impl.arrow_block.ArrowRow, n) -> int:
    return row % random.randint(1, n)

#### Create a window based pipeline
With a each window of 50 blocks. 

_Questions for clarification_:
 * _why the number of stages is 2?_

In [37]:
ds_pipe = ds.window(blocks_per_window=50)
ds_pipe

DatasetPipeline(num_windows=4, num_stages=2)

### Applying transforms to pipelines adds more pipeline stages.

In [38]:
ds_pipe = ds_pipe.map(lambda row: divide_row_value(row, 2))
ds_pipe = ds_pipe.map(lambda row: double_row_value(row, 3))
ds_pipe = ds_pipe.map(lambda row: modulo_row_value(row, 4))
print(ds_pipe)

DatasetPipeline(num_windows=4, num_stages=5)


#### Iterate our pipeline

 * _Questions for clearification_:
     * _how is this executed_?
     * _why are we iterating over rows_?
     * _what is row comprised of? Blocks?_?
     * _is the value of the row an already computed value_?
     * _if the `num_stages=5`, why am I seeing only stage 0 and 1 in the output of stages?_

In [39]:
results=[]
for row in ds_pipe.iter_rows():
    results.append(row)
print(f"Total value: {sum(results)}")

Stage 0:   0%|                                                                                                                          | 0/4 [00:00<?, ?it/s]
  0%|                                                                                                                                   | 0/4 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                                                          | 0/4 [00:00<?, ?it/s][A
Stage 0:  50%|█████████████████████████████████████████████████████████                                                         | 2/4 [00:00<00:00,  2.96it/s][A
Stage 0:  75%|█████████████████████████████████████████████████████████████████████████████████████▌                            | 3/4 [00:00<00:00,  4.01it/s][A
Stage 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.25it/s][A
Stage 0: 100%|█████████████████

Total value: 37036





Let's try a Datapipeline with our synthetic data *Homewowners*

In [40]:
# count or return based on the condition
def count_state(row: ray.data.impl.arrow_block.ArrowRow, state) -> int:
    return 1 if row['state'] == "CA" and row["defaulted"] else 0

In [41]:
arrow_ds_pipe = arrow_ds.window(blocks_per_window=50)
arrow_ds_pipe

DatasetPipeline(num_windows=1, num_stages=2)

In [42]:
arrow_ds_pipe = arrow_ds_pipe.map(lambda row: count_state(row, "CA"))
arrow_ds_pipe

DatasetPipeline(num_windows=1, num_stages=3)

In [43]:
results=[]
for row in arrow_ds_pipe.iter_rows():
    results.append(row)
print(f"Total rows for CA state and defaulted loans rows: {sum(results)}")

Stage 0:   0%|                                                                                                                          | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                                   | 0/1 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                                                          | 0/1 [00:00<?, ?it/s][A
Stage 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.71s/it][A
Stage 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.72s/it]

Total rows for CA state and defaulted loans rows: 62339





## Ingesting data into Model Trainers
Let's define a toy `Trainer` actor that takes our synthetic data and trains the model and returns loss for that trainer. This is common pattern
in Ray for distributing data to Trainers in a Ray cluster.

<img src="images/trainer_worker.jpg" width="75%" height="40%">

In [44]:
def model(input):
    return random.uniform(0, 1)

@ray.remote
class Trainer:
    def __init__(self, rank, model):
        self.rank = rank
        self.model = model
        self.loss = 0.0
        
    def train(self, shard:ray.data.Dataset) -> float:
        for batch in shard.iter_batches(batch_size=1024):
            for epoch in range(1,21):
                output = self.model(batch)
                self.loss = output 
        if epoch % 5 == 0:
            print(f'epoch {epoch}, loss: {self.loss:.3f}')
        return self.loss

#### Create five trainers, each with a copy of the model and each training on its respective shard

In [45]:
trainers = [Trainer.remote(i, model) for i in range(1, 6)]
trainers

[Actor(Trainer, e4fd45ef24f31d2a06522e6001000000),
 Actor(Trainer, e970e56818bb7578580f577501000000),
 Actor(Trainer, 43ee67fe2c6461d824cae15501000000),
 Actor(Trainer, 1fc03fbd81f81d1c1cbd1d6501000000),
 Actor(Trainer, 60917854ad08c0fd3556c6bd01000000)]

[2m[36m(_map_block_nosplit pid=51739)[0m E0711 22:21:30.234691000 6212841472 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(_map_block_nosplit pid=51741)[0m E0711 22:21:30.402044000 6169604096 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"


#### Split the shards across all trainers

In [46]:
shards = arrow_ds.split(n=len(trainers), locality_hints=trainers)
shards

Read progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.84it/s]


[Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=150000, schema={id: int64, ssn: null, name: null, amount

#### Launch our trainers in a distributed fashion

This will run across the cluster. Check the dashbard to see five actors launched. On a cluster, they will on five different nodes, whereas on a single noded 
five different cores.

In [47]:
object_refs = [t.train.remote(s) for t, s in zip(trainers, shards)]
ray.get(object_refs)

[0.03482623803499796,
 0.2368608264989308,
 0.9545626385599913,
 0.5875607301935346,
 0.14364284659846804]

[2m[36m(Trainer pid=52095)[0m epoch 20, loss: 0.588
[2m[36m(Trainer pid=52096)[0m epoch 20, loss: 0.144
[2m[36m(Trainer pid=52092)[0m epoch 20, loss: 0.035
[2m[36m(Trainer pid=52094)[0m epoch 20, loss: 0.955
[2m[36m(Trainer pid=52093)[0m epoch 20, loss: 0.237


In [48]:
ray.shutdown()

### Exercises
 1. Write some simple transformers, filters, and aggregators with our synthetic data. For example:
  * use [`.add_column()`](https://docs.ray.io/en/master/data/package-ref.html) to add an `age` column
  * filter by gender == 'U'
  * aggregate (or groupby `property`) and count each. 
 2. Add additional pipleline stages function `def count_tx(...)` with our synthetic data. For example, count all people in state of `TX`, `married` and `defaulted`.

### Homework

1. Work through the [NYC example tutorial](extra/ray_data_nyc.ipynb). This explores how you use `.map_batches()` for filtering and map operations using vectorized UDFs
2. Peruse the user guides for advanced examples in [data transformation](https://docs.ray.io/en/master/data/transforming-datasets.html#transforming-datasets) and [ML preprocessing](https://docs.ray.io/en/master/data/dataset-ml-preprocessing.html#datasets-ml-preprocessing)
3. Read how to do large scale [ML ingest](https://docs.ray.io/en/master/data/examples/big_data_ingestion.html)