# Gentle introduction to Ray datasets APIs

© 2019-2022, Anyscale. All Rights Reserved

### Overview

This is a brief introduction to Ray's native library `ray dataset`. As a native Ray library, built atop Ray, it allows you to exchange data among Ray tasks, actors, libraries, and applications. Additionally, Ray datasets provides standard transformations like `map`, `filter`, and `partition`. Ray datasets is *not* a replacement for a full-fledged data processing library EDA, ETL or a subsitute for Apache Spark or Dask or Pandas DataFrames. It's primary objective is last-mile rudimentary data preprocessing and data ingestion for ML training.

Supporting myriad [file formats and data sources](https://docs.ray.io/en/latest/data/dataset.html#datasource-compatibility), you can read from and write to local FS and cloud storage. 

<img src="images/dataset.png" width="70%" height="35%">

### Learning objectives

In this introductory tutorial you will learn:
 * create, transform, read and save Ray datasets
 * understand datapipelines and its use
 * use Ray data for last-mile ML ingestion for your distributed trainers

### Ray Datasets

A Ray dataset implements a distributed [Apache Arrow](https://arrow.apache.org/). A Dataset consists of a list of Ray object references to blocks. Each block holds a set of items in either an [Arrow table](https://arrow.apache.org/docs/python/data.html#tables) or a Python list (for Arrow incompatible objects).

<img src="images/dataset-arch.png" width="70%" height="35%">

### Creating datasets

In [17]:
import logging, random
import ray

In [18]:
if ray.is_initialized:
    ray.shutdown()
ctx = ray.init(logging_level=logging.ERROR)
print(ctx)

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.13', ray_version='1.12.1', ray_commit='4863e33856b54ccf8add5cbe75e41558850a1b75', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-24_17-43-48_949432_96774/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-24_17-43-48_949432_96774/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-05-24_17-43-48_949432_96774', 'metrics_export_port': 64567, 'gcs_address': '127.0.0.1:64282', 'address': '127.0.0.1:64282', 'node_id': 'b08f82ca3a745634027e92818d54aa6a247ce6278c638495b9dbf87c'})


In [19]:
print(f"Dashboard url: http://{ctx.address_info['webui_url']}")

Dashboard url: http://127.0.0.1:8265


Let's create a generic dataset of 20K integers and look at the schema and underlying datatype. The difference between `show` and `take` is that the former takes one item at time and prints it, while the latter iterates over rows items from the dataset, appends to a list and returns it. `ds.show()` calls `ds.take()`

In [20]:
ds = ray.data.range(20_000)
ds.count(), ds.schema(), ds.show(5), ds.take(5)

0
1
2
3
4


(20000, int, None, [0, 1, 2, 3, 4])

Let's create a dataset of Arrow records with three columns and data associated with it. 

In [21]:
STATES = ["CA", "AZ", "OR", "WA", "TX", "UT"]
M_STATUS = ["married", "single", "domestic", "divorced", "undeclared"]
GENDER = ["F", "M", "U"]
items = [{"id": i, 
          "amount": i * 1.5, 
          "interest": random.randint(1,5) * .1,
          "state": random.choice(STATES),
          "marital_status": random.choice(M_STATUS),
          "defaulted": random.randint(0,1),
          "gender":random.choice(GENDER) } for i in range(1,500_001)]
items[:2]

[{'id': 1,
  'amount': 1.5,
  'interest': 0.1,
  'state': 'CA',
  'marital_status': 'divorced',
  'defaulted': 0,
  'gender': 'M'},
 {'id': 2,
  'amount': 3.0,
  'interest': 0.1,
  'state': 'CA',
  'marital_status': 'divorced',
  'defaulted': 1,
  'gender': 'U'}]

In [22]:
arrow_ds = ray.data.from_items(items)

In [23]:
arrow_ds.count(), arrow_ds.take(2)

(500000,
 [{'id': 1, 'amount': 1.5, 'interest': 0.1, 'state': 'CA', 'marital_status': 'divorced', 'defaulted': 0, 'gender': 'M'},
  {'id': 2, 'amount': 3.0, 'interest': 0.1, 'state': 'CA', 'marital_status': 'divorced', 'defaulted': 1, 'gender': 'U'}])

In [24]:
arrow_ds.schema

<bound method Dataset.schema of Dataset(num_blocks=200, num_rows=500000, schema={id: int64, amount: double, interest: double, state: string, marital_status: string, defaulted: int64, gender: string})>

### Saving datasets
Ray datasets support myriad data formats and public storage. Let's save this dataset as a parquet file and create 4 partitions

In [27]:
arrow_ds.repartition(5).write_parquet("data/interest.parquet")

Repartition: 100%|███████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 89.42it/s]
Write Progress: 100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 65.13it/s]


In [28]:
!ls -l data/interest.parquet

total 13552
-rw-r--r--  1 jules  staff  1399864 May 24 17:44 a6484d528e5a49d481ba95abbfa3a934_000000.parquet
-rw-r--r--  1 jules  staff  1384054 May 24 17:44 a6484d528e5a49d481ba95abbfa3a934_000001.parquet
-rw-r--r--  1 jules  staff  1382275 May 24 17:44 a6484d528e5a49d481ba95abbfa3a934_000002.parquet
-rw-r--r--  1 jules  staff  1381706 May 24 17:44 a6484d528e5a49d481ba95abbfa3a934_000003.parquet
-rw-r--r--  1 jules  staff  1381278 May 24 17:44 a6484d528e5a49d481ba95abbfa3a934_000004.parquet


### Transformation

Ray datasets support transformation in parallel using `map`. It uses ray tasks to execute eagerly. 

Among others [transformations](https://docs.ray.io/en/latest/data/package-ref.html#dataset-api), it supports`filter`, `flat_map`, `groupBy`etc.

Let's try a using `.map()` and `.filter()` on our dataset.

In [29]:
arrow_ds.map(lambda x: x['amount'] * 2.5).take(5)

Map Progress: 100%|██████████████████████████████████████████████████████████████████████| 200/200 [00:02<00:00, 95.78it/s]


[3.75, 7.5, 11.25, 15.0, 18.75]

In [30]:
arrow_ds.filter(lambda x: x['amount'] > 10000.00 and x['state'] == 'CA').take(2)

Map Progress: 100%|██████████████████████████████████████████████████████████████████████| 200/200 [00:02<00:00, 88.46it/s]


[{'id': 6668, 'amount': 10002.0, 'interest': 0.30000000000000004, 'state': 'CA', 'marital_status': 'undeclared', 'defaulted': 0, 'gender': 'U'},
 {'id': 6672, 'amount': 10008.0, 'interest': 0.4, 'state': 'CA', 'marital_status': 'undeclared', 'defaulted': 1, 'gender': 'U'}]

### Exchanging datasets

Datasets can be passed to Ray tasks or actors and read with `.iter_batches()` or `.iter_rows()`. This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects.

Let's examine how

In [31]:
@ray.remote
class BatchWorker:
    def __init__(self, rank):
        self.rank = rank
        self.processed= 0
    
    @ray.method(num_returns=2)
    def process_shard_list(self, shard) -> int:
        for batch in shard.iter_batches(batch_size=1024):
            # do something with the batch
            # maybe create a parquet file 
            self.processed = self.processed + len(batch)
        # return items processed, worker id
        return (self.processed, self.rank)     

#### Create batch workers as Ray actors
Each actor will get a shard, list of rows, to work on. We split
our dataset `arrow_ds` into five shards. We `BatchWorker` gets a shard.

In [32]:
batch_workers = [BatchWorker.remote(i) for i in range(1, 6)]
shards = arrow_ds.split(n=5, locality_hints=batch_workers)

In [33]:
object_refs = [w.process_shard_list.remote(s) for w, s in zip(batch_workers, shards)]

In [34]:
values = [ray.get(ref) for ref in object_refs]
values

[[100000, 1], [100000, 2], [100000, 3], [100000, 4], [100000, 5]]