# Gentle introduction to Ray datasets APIs

© 2019-2022, Anyscale. All Rights Reserved

### Overview

This is a brief introduction to Ray's native library `ray dataset`. As a native Ray library, built atop Ray, it allows you to exchange data among Ray tasks, actors, libraries, and applications. Additionally, Ray datasets provides standard transformations like `map`, `filter`, and `partition`. Ray datasets is *not* a replacement for a full-fledged data processing library EDA, ETL or a subsitute for Apache Spark or Dask or Pandas DataFrames. It's primary objective is last-mile rudimentary data preprocessing and data ingestion for ML training.

Supporting myriad [file formats and data sources](https://docs.ray.io/en/latest/data/dataset.html#datasource-compatibility), you can read from and write to local FS and cloud storage. 

<img src="images/dataset.png" width="70%" height="35%">

### Learning objectives

In this introductory tutorial you will learn:
 * create, transform, read and save Ray datasets
 * use shards for parallel processing of large datasets
 * understand datapipelines and their merits
 * use DatasetPipeline for last-mile ML ingestion for your distributed trainers

### Ray Datasets

A Ray dataset implements a distributed [Apache Arrow](https://arrow.apache.org/). A Dataset consists of a list of Ray object references to blocks. Each block holds a set of items in either an [Arrow table](https://arrow.apache.org/docs/python/data.html#tables) or a Python list (for Arrow incompatible objects).

<img src="images/dataset-arch.png" width="70%" height="35%">

### Creating datasets

In [None]:
import logging, random
import ray

In [None]:
if ray.is_initialized:
    ray.shutdown()
ctx = ray.init(logging_level=logging.ERROR)
print(ctx)

In [None]:
print(f"Dashboard url: http://{ctx.address_info['webui_url']}")

Let's create a generic dataset of 50K integers and look at the schema and underlying datatype. The difference between `show` and `take` is that the former takes one item at time and prints it, while the latter iterates over rows items from the dataset, appends to a list and returns it. `ds.show()` calls `ds.take()`.

In [None]:
ds = ray.data.range(100_000)
ds.count(), ds.schema(), ds.show(5), ds.take(5)

Let's create a synthetic dataset of Arrow records with seven columns and data associated with it. 

In [None]:
STATES = ["CA", "AZ", "OR", "WA", "TX", "UT"]
M_STATUS = ["married", "single", "domestic", "divorced", "undeclared"]
GENDER = ["F", "M", "U"]

items = [{"id": i, 
          "amount": i * 1.5, 
          "interest": random.randint(1,5) * .1,
          "state": random.choice(STATES),
          "marital_status": random.choice(M_STATUS),
          "defaulted": random.randint(0,1),
          "gender":random.choice(GENDER) } for i in range(1,250_001)]
items[:2]

In [None]:
arrow_ds = ray.data.from_items(items)
arrow_ds

In [None]:
arrow_ds.count(), arrow_ds.take(2)

In [None]:
arrow_ds.schema

### Saving datasets
Ray datasets support myriad data formats and public storage. Let's save this dataset as a parquet file and create 4 partitions

In [None]:
arrow_ds.repartition(5).write_parquet("data/interest.parquet")

In [None]:
!ls -l data/interest.parquet

### Transformation

Ray datasets support transformation in parallel using `map`. It uses ray tasks to execute eagerly or synchronously. Among others [transformations](https://docs.ray.io/en/latest/data/package-ref.html#dataset-api), it supports`filter`, `flat_map`, `groupBy`etc.

Let's try a using `.map()` and `.filter()` on our dataset.

In [None]:
arrow_ds.map(lambda x: x['amount'] * 2.5).take(5)

In [None]:
arrow_ds.filter(lambda x: x['amount'] > 10000.00 and x['state'] == 'CA').take(2)

### Accessing datasets

Datasets can be passed to Ray tasks or actors and read with `.iter_batches()` or `.iter_rows()`. This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects. Splitting data as shards and passing to individual Ray Actors to process shards in a common Ray pattern used in distributed training with Ray actors.

Let's examine how.

In [None]:
@ray.remote
class BatchWorker:
    def __init__(self, rank):
        self.rank = rank
        self.processed= 0
    
    @ray.method(num_returns=2)
    def process_shard_list(self, shard) -> tuple:
        for batch in shard.iter_batches(batch_size=1024):
            # do something with the batch such as feature
            # processing and transformation and 
            # save as a parquet files 
            self.processed = self.processed + len(batch)
        # return items processed, worker id
        return (self.processed, self.rank)     

#### Create batch workers as Ray actors
Each actor will get a shard, list of rows, to work on. We split
our dataset `arrow_ds` into five shards. Each `BatchWorker` gets a shard.

In [None]:
batch_workers = [BatchWorker.remote(i) for i in range(1, 6)]
shards = arrow_ds.split(n=5, locality_hints=batch_workers)
shards

Launch the `BatchWorker` actors to process each shard. Each `BatchWorker.process_shard_list()` returns a object RefID with a tuple as its value. What we get from this comprehension is a list objectRefs as tuples.

In [None]:
object_refs = [w.process_shard_list.remote(s) for w, s in zip(batch_workers, shards)]
# object_refs, len(object_refs)

Fetch the values returned from the returne list of ObjectRefs, which is a tuple of (batch_size, worker_rank).

In [None]:
values = [ray.get(ref) for ref in object_refs]
values

### Creating and using dataset pipelines

What are dataset pipelines and how are they different from Ray datasets? Datasets perform transformation or operations eagerly or synchronously, whereas [DataPipelines](https://docs.ray.io/en/latest/data/package-ref.html#datasetpipeline-api) can execute in an overlap pipeline executions. For example, if you had operations that require reading from file, transforming data, and then doing some minor feature engineering, these operations can be executed in a normal pipeline fashion. This allows for the overlapped execution of data input (e.g., reading files), computation (e.g. feature preprocessing), and output (e.g., distributed ML training).

A DatasetPipeline can be constructed in two ways: either by pipelining the execution of an existing Dataset (via `Dataset.window`), or generating repeats of an existing Dataset (via `Dataset.repeat`). As stated, there a couple of ways to create a pipeline in a staged manner from an existing Ray dataset.

Let's have a go at it and see what we can do with our synthetic data from above.


### Using Dataset.window

Create some functions or operations to be executed in a overlapped manner in the pipeline

In [10]:
def count_1(row: ray.data.impl.arrow_block.ArrowRow):
    return row + 1

In [11]:
def count_2(row: ray.data.impl.arrow_block.ArrowRow):
    return row * 2

In [12]:
def count_3(row: ray.data.impl.arrow_block.ArrowRow):
    return row % 3

#### Create a window based pipeline

In [15]:
ds_pipe = ds.window(blocks_per_window=50)

In [16]:
# Applying transforms to pipelines adds more pipeline stages.
ds_pipe = ds_pipe.map(count_1)
ds_pipe = ds_pipe.map(count_2)
ds_pipe = ds_pipe.map(count_2)
print(ds_pipe)

DatasetPipeline(num_windows=4, num_stages=5)


#### Iterate our pipeline

In [17]:
results=[]
for row in ds_pipe.iter_rows():
    results.append(row)
print(f"Total value: {sum(results)}")

Stage 0:   0%|                                                                                       | 0/4 [00:00<?, ?it/s]
  0%|                                                                                                | 0/4 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                       | 0/4 [00:00<?, ?it/s][A
Stage 0:  50%|███████████████████████████████████████▌                                       | 2/4 [00:01<00:01,  1.51it/s][A
Stage 0:  75%|███████████████████████████████████████████████████████████▎                   | 3/4 [00:02<00:00,  1.08it/s][A
Stage 0: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.02s/it][A
Stage 1: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.29s/it][A
Stage 0: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  

Total value: 20000200000





Let's try with our synthetic data

In [18]:
# count or return based on the condition
def count_ca(row: ray.data.impl.arrow_block.ArrowRow):
    return 1 if row['state'] == "CA" and row["defaulted"] else 0

In [19]:
arrow_ds_pipe = arrow_ds.window(blocks_per_window=50)

In [20]:
arrow_ds_pipe = arrow_ds_pipe.map(count_ca)
print(arrow_ds_pipe)

DatasetPipeline(num_windows=4, num_stages=2)


In [21]:
results=[]
for row in arrow_ds_pipe.iter_rows():
    results.append(row)
print(f"Total CA state and defaulted loans rows: {sum(results)}")

Stage 0:   0%|                                                                                       | 0/4 [00:00<?, ?it/s]
  0%|                                                                                                | 0/4 [00:00<?, ?it/s][A
Stage 1:   0%|                                                                                       | 0/4 [00:00<?, ?it/s][A
Stage 0:  50%|███████████████████████████████████████▌                                       | 2/4 [00:01<00:01,  1.63it/s][A
Stage 0:  75%|███████████████████████████████████████████████████████████▎                   | 3/4 [00:01<00:00,  1.93it/s][A
Stage 0: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.21it/s][A
Stage 1: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.64it/s][A
Stage 0: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  

Total CA state and defaulted loans rows: 20761





In [22]:
ray.shutdown()