# A Gentle introduction to Ray datasets APIs

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
⬅️ [Previous notebook](./ex_06_ray_api_calls.ipynb) <br>

### Overview

Ray data is built primarly for Ray AIR for easy data ingestion and
preprocessing for machine learning training, tuning, and scoring. As a library built atop Ray, Ray data allows you to: 
 * exchange data among Ray tasks, actors, libraries, and applications
 * read/write training data from different file sources and storage
 * supports myriad [file formats and data sources](https://docs.ray.io/en/latest/data/dataset.html#datasource-compatibility).
 * transformations like `map`, `filter`, and `partition`. 

**Note**: Ray datasets is not a replacement for a full-fledged data processing library for doing exploratory data analysis (EDA), extract, transform and load (ETL) or a subsitute for Apache Spark or Dask or Pandas DataFrames. Its primary objective is the last-mile rudimentary and efficient distributed data preprocessing and data ingestion for ML training.

<img src="images/dataset.png" width="75%" height="45%">

### Learning objectives

In this introductory tutorial you will learn:
 * Ray datasets concepts
 * how to create, transform, read, and save Ray datasets
 * use shards for parallel processing of large datasets
 * use datasets for ML ingestion for distributed training
 
**Note**: Even though you will get an introduction into Ray data as a libary here, most of its functionality is implemented and subsumed within [Ray AIR](), which you'll hear about it tomorrow in the keynotes and a few break out sessions.

### Key concepts

To work with Ray Datasets, you need to understand how `Datasets` and `Dataset Pipelines` work. A quick few code examples will shed light into overall merits of Ray Datasets.

Let's start with the internal format. 

#### Ray Datasets

A Ray dataset is backed by [Apache Arrow](https://arrow.apache.org/). As such, a Dataset consists of a *list of Ray object references* to blocks in the *shared memory object store*. Each block holds a set of items in either an [Arrow table](https://arrow.apache.org/docs/python/data.html#tables) (when created from or transformed to tabular or tensor data), a Pandas DataFrame (when created from or transformed to Pandas data), or a Python list (otherwise).

<img src="images/dataset-arch.png" width="60%" height="30%">

#### Dataset Pipelines
Datasets execute their transformations synchronously or eagerly in blocking calls. However, it can be useful to overlap dataset computations with output. This can be done with a `DatasetPipeline`.

A `DatasetPipeline` is a unified iterator over a sequence of Ray Datasets, each of which represents a window over the original data. Conceptually, it is similar to a `Spark DStream`, but manages execution over a bounded source of data instead of an unbounded stream. Ray computes each dataset window and stitches their output together into a single logical data iterator. 

`DatasetPipeline` offers the same transformation and output methods as Datasets (e.g., `map`, `filter`, `split`, `iter_rows`, `to_torch`, etc.).

<img src="images/ray_dataset_pipeline.jpg" width="60%" height="30%">

#### Common use case for Dataset Pipeline: Saturate your GPU resources

The best use case for pipelines is efficient use of CPU/GPU resources

<img src="images/gpu_saturation.png" width="70%" height="35%">

### Datasets Execution Model

Let's get the feel of dataset execution model

#### Reading Data
Datasets uses Ray tasks, for parallelism, to read data from remote storage or source. When reading from a file-based datasource (e.g., S3, GCS), it creates a number of parallel
read tasks equal to the specified read parallelism. You can further [fine tune parallelism](https://docs.ray.io/en/master/data/performance-tips.html#tuning-read-parallelism).
One or more files will be assigned to each read task. Each read task reads its assigned files and produces one or more output blocks (Ray objects):

<img src="https://docs.ray.io/en/master/_images/dataset-read.svg" height="25%" width="50%">

Normally, each read task produces a single output block. Read tasks may split the output into multiple blocks if the data exceeds the target max block size (2GiB by default). This automatic block splitting avoids out-of-memory errors when reading very large single files (e.g., a 100-gigabyte CSV file). 

#### Deferred Read Task Execution

When a Dataset is created using `ray.data.read_*`, only the first read is executed initially. This avoids blocking Dataset creation on the reading of all data files, enabling inspection functions like `ds.schema()` without incurring high read costs. `<ray.data.Dataset.schema>`() and `ds.show()` can be used right away. 

#### Dataset Transforms

Datasets use either Ray tasks or Ray actors to transform datasets (i.e., for `ds.map_batches()`, `ds.map()`, or `ds.flat_map()`). By default, tasks are used `(compute="tasks")`. Actors can be specified with `compute="actors"`, along with an autoscaling pool of Ray actors used for transformations. 

Use `ActorStrategyPool` for expensive state initialization (e.g., for GPU-based tasks) that can be re-used. For examples, when workers take time to initialize GPU memory or certain values within an actor that could be resused again.

<img src="https://docs.ray.io/en/master/_images/dataset-map.svg" height="25%" width="50%">


#### Shuffling Data

Certain operations like `ds.sort()` and `ds.groupby()` require data blocks to be partitioned by value.

You can also change the partitioning of a Dataset using `ds.random_shuffle()` or `ds.repartition()`. Use the former if you want to randomize the order of elements in the dataset; use the second if you only want to equalize the size of the Dataset blocks (e.g., after a read or transformation that may skew the distribution of block sizes). 

Note that repartition has two modes, `shuffle=False`, which performs the minimal data movement needed to equalize block sizes, and `shuffle=True`, which performs a full (non-random) distributed shuffle:

<img src="https://docs.ray.io/en/master/_images/dataset-shuffle.svg" height="25%" width="50%">

#### Fault tolerance

Datasets relies on task-based 🤹‍♀️ [fault tolerance](https://docs.ray.io/en/latest/ray-core/tasks/fault-tolerance.html) in Ray core. Specifically, a `Dataset` will be automatically recovered by Ray in case of failures. This works through **lineage reconstruction**: a Dataset is a collection of Ray objects stored in shared memory, and if any of these objects are lost, then Ray will recreate them by re-executing the task(s) that created them.

### Any Questions?

Let's move on to some code ....

In [3]:
import logging, os, random, warnings
import ray

In [4]:
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

In [5]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,2.0.0rc1
Dashboard:,http://127.0.0.1:8265


### Creating a simple Ray Dataset

For quick illustration, let's create a generic dataset of 1K integers and look at the schema and underlying datatype. [Ray Data API Documentation](https://docs.ray.io/en/latest/data/package-ref.html).

In [6]:
ds = ray.data.range(1000)
ds.count()

1000

In [7]:
ds.schema()

int

The difference between `show` and `take` is that the former takes one item at time and prints it, while the latter iterates over row items from the dataset, appends to a list and returns it. Underneath, `ds.show()` calls `ds.take()`.

In [8]:
ds.show(5)

0
1
2
3
4


In [9]:
ds.take(5)

[0, 1, 2, 3, 4]

### Creating a large Ray Dataset

Let's create a synthetic dataset, *Homeowners*, of Arrow records (800K) with several columns. To illustrate some simple transformational functions, we'll use this *Homeowners* generated data.

In [10]:
def get_anonomized_name(num_letters=6):
    import string
    return ("anonomized-" + "".join(random.choices(string.ascii_uppercase + string.ascii_lowercase, k=num_letters)))

def get_anonomized_ssn():
    return ("anonomized-" + "".join(["{}".format(random.randint(0, 9)) for num in range(1, 10)]))

In [11]:
NUM_ROWS = 800001
STATES = ["CA", "AZ", "OR", "WA", "TX", "UT", "NV", "NM"]
M_STATUS = ["married", "single", "domestic", "divorced", "undeclared"]
GENDER = ["F", "M", "U"]
HOME_OWNER = ["condo", "house", "rental", "cottage"]

items = [{"id": i,
          "ssn": get_anonomized_ssn(),
          "name": get_anonomized_name(),
          "amount": i * 1.5 * 10, 
          "interest": random.randint(1,5) * .1,
          "state": random.choice(STATES),
          "marital_status": random.choice(M_STATUS),
          "property": random.choice(HOME_OWNER),
          "dependents": random.randint(1, 5),
          "defaulted": random.randint(0,1),
          "gender":random.choice(GENDER) } for i in range(1,NUM_ROWS)]
items[:1]

[{'id': 1,
  'ssn': 'anonomized-517026164',
  'name': 'anonomized-SNjBPM',
  'amount': 15.0,
  'interest': 0.2,
  'state': 'OR',
  'marital_status': 'married',
  'property': 'cottage',
  'dependents': 5,
  'defaulted': 0,
  'gender': 'U'}]

#### Creating a dataset from list of dictionary items

Ray data can be created from a dictionary of items. The number of blocks can be [tuned as need](https://docs.ray.io/en/master/data/performance-tips.html#tuning-read-parallelism). 

In [12]:
arrow_ds = ray.data.from_items(items)
arrow_ds

Dataset(num_blocks=200, num_rows=800000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string})

In [13]:
arrow_ds.count()

800000

In [14]:
arrow_ds.take(1)

[ArrowRow({'id': 1,
           'ssn': 'anonomized-517026164',
           'name': 'anonomized-SNjBPM',
           'amount': 15.0,
           'interest': 0.2,
           'state': 'OR',
           'marital_status': 'married',
           'property': 'cottage',
           'dependents': 5,
           'defaulted': 0,
           'gender': 'U'})]

In [15]:
arrow_ds.schema()

id: int64
ssn: string
name: string
amount: double
interest: double
state: string
marital_status: string
property: string
dependents: int64
defaulted: int64
gender: string

### Saving datasets and reading as a parquet files 🗃
Ray datasets support myriad data formats. Let's save this dataset as a parquet file and create five partitions.

In [16]:
path = os.path.abspath("data_homeowners/interest.parquet")
arrow_ds.repartition(5).write_parquet(path)

Repartition:   0%|                                                                                                                            | 0/5 [00:00<?, ?it/s][2m[36m(_execute_read_task pid=10812)[0m E0818 19:52:17.806195000 6169030656 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Repartition: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.55it/s]
Write Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 23.43it/s]


In [17]:
!ls -l data_homeowners/interest.parquet

total 53848
-rw-r--r--  1 jules  staff  5417830 Aug 18 19:52 9eb9965c1af44d4c900070fffd6252bc_000000.parquet
-rw-r--r--  1 jules  staff  5534026 Aug 18 19:52 9eb9965c1af44d4c900070fffd6252bc_000001.parquet
-rw-r--r--  1 jules  staff  5534818 Aug 18 19:52 9eb9965c1af44d4c900070fffd6252bc_000002.parquet
-rw-r--r--  1 jules  staff  5533739 Aug 18 19:52 9eb9965c1af44d4c900070fffd6252bc_000003.parquet
-rw-r--r--  1 jules  staff  5534416 Aug 18 19:52 9eb9965c1af44d4c900070fffd6252bc_000004.parquet


### Reading data from parquet files

Next, we read a few of the files from the dataset. This read is semi-lazy, where reading of the first file is eagerly executed, but reading of all other files is delayed until the underlying data is needed by downstream operations (e.g., consuming the data with `ds.take()`, or transforming the data with `ds.map_batches())`.

In [18]:
%%time
arrow_ds = ray.data.read_parquet(path)

CPU times: user 6.35 ms, sys: 4.67 ms, total: 11 ms
Wall time: 29.5 ms


Wow that was fast!

We can easily inspect the schema of this dataset. For Parquet files, we don’t even have to read the actual data to get the schema or take first rows; we can read it from the lightweight Parquet metadata!

In [19]:
%%time
arrow_ds.schema()

CPU times: user 96 µs, sys: 59 µs, total: 155 µs
Wall time: 103 µs


id: int64
ssn: string
name: string
amount: double
interest: double
state: string
marital_status: string
property: string
dependents: int64
defaulted: int64
gender: string

In [20]:
%%time
arrow_ds.take(1)

CPU times: user 4.06 ms, sys: 3.65 ms, total: 7.71 ms
Wall time: 8.27 ms


[ArrowRow({'id': 1,
           'ssn': 'anonomized-517026164',
           'name': 'anonomized-SNjBPM',
           'amount': 15.0,
           'interest': 0.2,
           'state': 'OR',
           'marital_status': 'married',
           'property': 'cottage',
           'dependents': 5,
           'defaulted': 0,
           'gender': 'U'})]

### Transforming data with simple methods

Ray datasets support transformation in parallel using `map`. It uses Ray tasks to execute eagerly or synchronously. Among others [transformations](https://docs.ray.io/en/latest/data/package-ref.html#dataset-api), it supports`filter`, `flat_map`, `groupBy`etc.

Let's try a using `.map()`, `.filter()` and `.groupBy` on our dataset. 

The `map()` and `filter()` are row-based operations. This can be expensive for large datasets. However, you can use `map_batches(...)` with `batch_size=4096` as default. This will create a task per block, and each batch is executed in parallel. Ray tasks are created per block for a map operation. 

Let's try first the row-based transformation

In [21]:
%%time
arrow_ds.filter(lambda x: x['amount'] > 10000).take(1)

Read->Filter:   0%|                                                                                                                           | 0/5 [00:00<?, ?it/s][2m[36m(_execute_read_task pid=10805)[0m E0818 20:21:05.612630000 6138195968 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Read->Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.07s/it]

CPU times: user 25 ms, sys: 13.5 ms, total: 38.5 ms
Wall time: 5.36 s





[ArrowRow({'id': 667,
           'ssn': 'anonomized-602910443',
           'name': 'anonomized-IMUiGX',
           'amount': 10005.0,
           'interest': 0.2,
           'state': 'CA',
           'marital_status': 'divorced',
           'property': 'cottage',
           'dependents': 3,
           'defaulted': 0,
           'gender': 'M'})]

Let's try a `.map_batches()`. We should expect faster execution. 

*Question: Why the `.map()` returned `ArrowRow` and `.map_batches` returned `PandasRow`*?

Because `map_batch(..., batch_format='native')` promotes it to Pandas DataFrame. You can
promote it to [other formats](https://docs.ray.io/en/master/data/package-ref.html#ray.data.Dataset.map_batches) such as `pyarrow`. The default is `native`, which is pandas.

In [22]:
%%time
arrow_ds.map_batches(lambda df: df[df["amount"] > 10000]).take(1)

Read->Map_Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14.60it/s]

CPU times: user 75.3 ms, sys: 23.8 ms, total: 99.2 ms
Wall time: 415 ms





[PandasRow({'id': 667,
            'ssn': 'anonomized-602910443',
            'name': 'anonomized-IMUiGX',
            'amount': 10005.0,
            'interest': 0.2,
            'state': 'CA',
            'marital_status': 'divorced',
            'property': 'cottage',
            'dependents': 3,
            'defaulted': 0,
            'gender': 'M'})]

You can see that `.map_batches()` is a lot faster than row based. So for large datasets use 
`.map_batches()`.

Let's try a filter operation: both per row operation and per batch

In [23]:
%%time
arrow_ds.filter(lambda x: x["amount"] > 10000.00 and x["state"] == 'CA').take(1)

Read->Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.88it/s]

CPU times: user 19.5 ms, sys: 11.3 ms, total: 30.9 ms
Wall time: 1.75 s





[ArrowRow({'id': 667,
           'ssn': 'anonomized-602910443',
           'name': 'anonomized-IMUiGX',
           'amount': 10005.0,
           'interest': 0.2,
           'state': 'CA',
           'marital_status': 'divorced',
           'property': 'cottage',
           'dependents': 3,
           'defaulted': 0,
           'gender': 'M'})]

In [24]:
%%time
arrow_ds.map_batches(lambda df: df[(df["amount"] > 10000) & (df["state"] == "CA")]).take(1)

Read->Map_Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 30.95it/s]

CPU times: user 29 ms, sys: 10.7 ms, total: 39.7 ms
Wall time: 182 ms





[PandasRow({'id': 667,
            'ssn': 'anonomized-602910443',
            'name': 'anonomized-IMUiGX',
            'amount': 10005.0,
            'interest': 0.2,
            'state': 'CA',
            'marital_status': 'divorced',
            'property': 'cottage',
            'dependents': 3,
            'defaulted': 0,
            'gender': 'M'})]

Use `groupBy` state and compute the count, which will will include a shuffle 
of data.

In [25]:
results = arrow_ds.groupby("state").count()

Read: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 120.41it/s]
Sort Sample: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1623.43it/s]
Shuffle Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.69it/s]
Shuffle Reduce: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1129.75it/s]


In [26]:
results.show()

{'state': 'AZ', 'count()': 99456}
{'state': 'CA', 'count()': 100476}
{'state': 'NM', 'count()': 100045}
{'state': 'NV', 'count()': 100078}
{'state': 'OR', 'count()': 100215}
{'state': 'TX', 'count()': 99757}
{'state': 'UT', 'count()': 100112}
{'state': 'WA', 'count()': 99861}


Get the max of certain columns

In [27]:
results = arrow_ds.max(["amount", "interest", "dependents"])
results

Read: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 132.18it/s]
Shuffle Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 825.49it/s]
Shuffle Reduce: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 348.19it/s]


ArrowRow({'max(amount)': 12000000.0,
          'max(interest)': 0.5,
          'max(dependents)': 5})

### Any questions?

### Accessing datasets using batches or iterating by rows

Datasets can be passed to Ray tasks or actors and iterated over with `.iter_batches()` or `.iter_rows()`. (This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects.) Splitting data as shards and passing to individual Ray Actors to process shards is a common Ray pattern used in distributed training with Ray actors.

Let's examine how we can process a list of shards with a `BatchWorker` Actor  in a distributed fashion

<img src="images/batch_worker.jpg" width="70%" height="35%">

A Ray actor `BatchWorker` working through shards in a batch size of 1024.

In [28]:
@ray.remote
class BatchWorker:
    def __init__(self, rank):
        self.rank = rank         # this could be rank of CPU/GPU or worker id
        self.processed = 0       # how much was processed keeps state of each actor
    
    @ray.method(num_returns=2)   # we want to return a tuple
    def process_shard_list(self, shard: ray.data.Dataset) -> tuple:
        for batch in shard.iter_batches(batch_size=1024):
            # here you could do something with the batch such as feature
            # preprocessing, image processing, minor transformation and then
            # saving as a parquet file, etc
            self.processed = self.processed + len(batch)
            
        # return items processed, worker id
        return (self.processed, self.rank)     

#### Create batch workers as Ray actors
Each actor will get a shard (list of rows) to work on. We split
our dataset `arrow_ds` into five shards. Each `BatchWorker` gets a shard.

`.split`() splits shards across these batch of workers by using the `locality_hints`.

In [29]:
# Comprehension list to create five actors, each a BatchWorker
batch_workers = [BatchWorker.remote(i) for i in range(1, 6)]

# Split into five shards, each one for an actor and co-locate 
# the blocks of the dataset with each actor to maximize data locality.
shards = arrow_ds.split(len(batch_workers), locality_hints=batch_workers)

print(f"Shard row: {shards[0]}")
print(f"Number of shards:{len(shards)}")
print(f"Number of shard workers:{len(batch_workers)}")

Read progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 131.93it/s]

Shard row: Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string})
Number of shards:5
Number of shard workers:5





### Launch `BatchWorker` actors

Each batch workProcess each shard. Each `BatchWorker.process_shard_list()` returns a object RefID with a tuple as its value. What we get from this comprehension is a list objectRefs as tuples.

In [33]:
object_refs = [w.process_shard_list.remote(s) for w, s in zip(batch_workers, shards)]
object_refs, len(object_refs)

([[ObjectRef(57f023b5f2c83c93b2108de0ce683f38946af7ed0100000001000000),
   ObjectRef(57f023b5f2c83c93b2108de0ce683f38946af7ed0100000002000000)],
  [ObjectRef(7486c9c5cb2b345ef895f615bca930b2d7e2c48e0100000001000000),
   ObjectRef(7486c9c5cb2b345ef895f615bca930b2d7e2c48e0100000002000000)],
  [ObjectRef(9a667646e288b252c8ae50d27af20f72e99373d90100000001000000),
   ObjectRef(9a667646e288b252c8ae50d27af20f72e99373d90100000002000000)],
  [ObjectRef(058595f16dc6f278b58e73759c8c58f7ae06dbc40100000001000000),
   ObjectRef(058595f16dc6f278b58e73759c8c58f7ae06dbc40100000002000000)],
  [ObjectRef(72482135a26f4e0f3830a675c8754f0fa8a6ecd50100000001000000),
   ObjectRef(72482135a26f4e0f3830a675c8754f0fa8a6ecd50100000002000000)]],
 5)

Fetch the values from the returned list of ObjectRefs, which is a tuple of (batch_size, worker_rank).

In [34]:
values = [ray.get(ref) for ref in object_refs]
values

[[320000, 1], [320000, 2], [320000, 3], [320000, 4], [320000, 5]]

In [41]:
# Kill the BatchWorkers
[ray.kill(w) for w in batch_workers]

[None, None, None, None, None]

### Any questions?

Let's examine `DataPipelines`

### Creating and using Ray dataset pipelines

What are dataset pipelines and how they are different from Ray datasets? 

Datasets perform transformation or operations eagerly or synchronously, whereas [DataPipelines](https://docs.ray.io/en/latest/data/package-ref.html#datasetpipeline-api) can execute in an overlapped pipeline executions. For example, if you had operations that require reading from a file, transforming data, and then doing some minor feature engineering, these operations can be executed in a normal pipeline fashion. This allows for the overlapped execution of data input (e.g., reading files), computation (e.g., feature preprocessing), and training (e.g., distributed ML training). 

<img src="images/pipeline_window.jpg" width="70%" height="35%">

#### Two ways to create `DatasetPipeline`

A `DatasetPipeline` can be constructed in two ways: either by pipelining the execution of an existing Dataset (via `Dataset.window`) or generating repeats of an existing Dataset (via `Dataset.repeat`). 

Let's have a go at it with the first method and see what we can do with our simple and synthetic data from above using `Dataset.window`.


### Using Dataset.window

Create simple functions or operations to be executed in a overlapped manner in the pipeline. These functions are simple to illustrate a point. But normally they can be complex for a particular use case.

In [35]:
def divide_row_value(row, n) -> int:
    return round(row / n)

In [36]:
def double_row_value(row, n) -> int:
    return row * n

In [37]:
def modulo_row_value(row , n) -> int:
    return row % random.randint(1, n)

#### Create a window based pipeline
With each window of 50 blocks. 

In [38]:
# Use our original simple dataset from above with 1K rows in integer
ds_pipe = ds.window(blocks_per_window=50)
ds_pipe

DatasetPipeline(num_windows=1, num_stages=2)

### Applying overlapping transforms to pipelines

#### Example 1:
This adds various overlapping functions with our simple dataset from above

In [34]:
ds_pipe = ds_pipe.map(lambda row: divide_row_value(row, 2))
ds_pipe = ds_pipe.map(lambda row: double_row_value(row, 3))
ds_pipe = ds_pipe.map(lambda row: modulo_row_value(row, 4))
print(ds_pipe)

DatasetPipeline(num_windows=1, num_stages=5)


#### Iterate our pipeline


In [35]:
results=[]
for row in ds_pipe.iter_rows():
    results.append(row)
# print(f"Results from each pipeline map function:{results}")
print(f"Total value of the results: {sum(results)}")

Stage 0:   0%|                                                                                                                             | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                                      | 0/1 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                                                             | 0/1 [00:00<?, ?it/s][A
Stage 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.74it/s][A
Stage 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.73it/s]

Total value of the results: 393





#### Example 2

Let's try a `Datapipeline` with our synthetic data *Homewowners* generated from above.

In [42]:
# Define a function to count or return based on the condition
def count_state(row, state) -> int:
    return 1 if row['state'] == state and row["defaulted"] else 0

In [43]:
# Create a window-based pipeline
arrow_ds_pipe = arrow_ds.window(blocks_per_window=50)
arrow_ds_pipe

DatasetPipeline(num_windows=1, num_stages=2)

In [44]:
# Add the first stage to the pipeline
arrow_ds_pipe = arrow_ds_pipe.map(lambda row: count_state(row, "CA"))
arrow_ds_pipe

DatasetPipeline(num_windows=1, num_stages=3)

In [45]:
# Execute the pipeline 
results=[]
for row in arrow_ds_pipe.iter_rows():
    results.append(row)
print(f"Total rows for CA state and defaulted loans rows: {sum(results)}")

Stage 0:   0%|                                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                                         | 0/1 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                                                                | 0/1 [00:00<?, ?it/s][A
Stage 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.88s/it][A
Stage 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.88s/it]

Total rows for CA state and defaulted loans rows: 50029





### Any questions?

Let's the pattern we used above for `BatchWorker` a step further and use it
for distributed model training.

## Ingesting data into Model Trainers
Let's define a toy `Trainer` actor that takes our synthetic data and trains the model and returns loss for that trainer. This is a common pattern in Ray libraries for distributing data to Trainers in a Ray cluster.

<img src="images/trainer_worker.jpg" width="60%" height="30%">

In [46]:
# Our dummy model computing loss or accuracy
import time

def model(input):
    # do some training and compute the loss
    # ....
    # time.sleep(0.005)
    return random.uniform(0, 1)

@ray.remote
class Trainer:
    def __init__(self, rank, model):
        self.rank = rank
        self.model = model # copy of the model (or ray.put(model))
        self.loss = 0.0
        
    def train(self, shard:ray.data.Dataset) -> float:
        for epoch in range(1,21):
            for batch in shard.iter_batches(batch_size=1024):
                output = self.model(batch)
                self.loss = output 
            if epoch % 5 == 0:
                print(f'rank: {self.rank} epoch: {epoch}, loss: {self.loss:.3f}')
        return self.loss

#### Create five trainers, each with a copy of the model and each training on its respective shard

In [47]:
trainers = [Trainer.remote(i, model) for i in range(1, 6)]
trainers

[Actor(Trainer, f5248240a89ca6c89f93df9201000000),
 Actor(Trainer, f2e51c72c5b5eaab8cf882db01000000),
 Actor(Trainer, e0947e0b2c3d3f6328b44fb401000000),
 Actor(Trainer, 6c444dfefdf6bd2dbe4af95901000000),
 Actor(Trainer, 29b4147167f5992c157e18b401000000)]

#### Split the shards across all trainers

In [48]:
shards = arrow_ds.split(n=len(trainers), locality_hints=trainers)
shards

Read progress:   0%|                                                                                                                          | 0/5 [00:00<?, ?it/s][2m[36m(_map_block_nosplit pid=15728)[0m E0818 20:56:43.668332000 6212366336 chttp2_transport.cc:1111]          Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Read progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.57it/s]


[Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, name: string, amount: double, interest: double, state: string, marital_status: string, property: string, dependents: int64, defaulted: int64, gender: string}),
 Dataset(num_blocks=1, num_rows=160000, schema={id: int64, ssn: string, 

#### Launch our trainers in a distributed fashion

This will run across the cluster. Check the dashbard to see five actors launched. On a cluster, they will be on five different nodes, whereas on a single node on  
five different cores.

In [49]:
object_refs = [t.train.remote(s) for t, s in zip(trainers, shards)]

[2m[36m(Trainer pid=16152)[0m rank: 4 epoch: 5, loss: 0.895
[2m[36m(Trainer pid=16151)[0m rank: 3 epoch: 5, loss: 0.448
[2m[36m(Trainer pid=16149)[0m rank: 1 epoch: 5, loss: 0.518
[2m[36m(Trainer pid=16153)[0m rank: 5 epoch: 5, loss: 0.651
[2m[36m(Trainer pid=16150)[0m rank: 2 epoch: 5, loss: 0.297
[2m[36m(Trainer pid=16149)[0m rank: 1 epoch: 10, loss: 0.860
[2m[36m(Trainer pid=16153)[0m rank: 5 epoch: 10, loss: 0.158
[2m[36m(Trainer pid=16152)[0m rank: 4 epoch: 10, loss: 0.750
[2m[36m(Trainer pid=16150)[0m rank: 2 epoch: 10, loss: 0.049
[2m[36m(Trainer pid=16151)[0m rank: 3 epoch: 10, loss: 0.604
[2m[36m(Trainer pid=16149)[0m rank: 1 epoch: 15, loss: 0.569
[2m[36m(Trainer pid=16153)[0m rank: 5 epoch: 15, loss: 0.347
[2m[36m(Trainer pid=16152)[0m rank: 4 epoch: 15, loss: 0.444
[2m[36m(Trainer pid=16150)[0m rank: 2 epoch: 15, loss: 0.297
[2m[36m(Trainer pid=16151)[0m rank: 3 epoch: 15, loss: 0.116
[2m[36m(Trainer pid=16149)[0m rank: 1 epoc

In [50]:
ray.get(object_refs)

[0.8645845130210484,
 0.4362154258789336,
 0.005786463921397145,
 0.9289078395198194,
 0.9924620574567161]

In [51]:
ray.shutdown()

### Exercises

Use [Ray data API](https://docs.ray.io/en/latest/data/package-ref.html) for reference:

 1. Write some simple transformers, filters, and aggregators with our synthetic data. For example:
  * use [`.add_column()`](https://docs.ray.io/en/latest/data/package-ref.html) to add an `age` column
  * filter by gender == 'U'
  * aggregate (or groupby `property`) and count each. 
 2. Add additional pipleline stages function `def count_tx(...)` with our synthetic data. For example, count all people in state of `TX` who are `married` and `defaulted`.

### Homework

So far we have covered the basics of Ray Datasets. There are advanced topics that you can now explore since you know the basics. Below is a list of tasks you will want to work through at home.

1. Re-write all `map_batches()` tasks transformations using [`ActorPoolStrategy` pool](https://docs.ray.io/en/latest/data/transforming-datasets.html#compute-strategy)
2. Work through the [NYC example tutorial](extra/ray_data_nyc.ipynb). This explores how you use `.map_batches()` for filtering and map operations and Parquet push-down predicates for projection.
3. Peruse the user guides for advanced examples in [data transformation](https://docs.ray.io/en/master/data/transforming-datasets.html#transforming-datasets) and [ML preprocessing](https://docs.ray.io/en/master/data/dataset-ml-preprocessing.html#datasets-ml-preprocessing).
4. Read how to do large scale [ML ingest](https://docs.ray.io/en/master/data/examples/big_data_ingestion.html).
5. Check out the advanced [pipeline usage](https://docs.ray.io/en/latest/data/advanced-pipelines.html#).

### References

1. [Ray Data Documentation](https://docs.ray.io/en/latest/data/dataset.html)
2. [Ray Data Webinar Talk](https://www.anyscale.com/events/2022/02/23/ray-datasets-scalable-data-preprocessing-for-distributed-ml)

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
⬅️ [Previous notebook](./ex_06_ray_api_calls.ipynb) <br>

Done! 🍻