# Intro to Ray Data

This notebook will provide an overview of Ray Data and how to use it to load, and transform data in a distributed manner.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 1:</b> When to use Ray Data</a></li>
    <li><b>Part 2:</b> Loading Data</a></li>
    <li><b>Part 3:</b> Transforming Data</a></li>
    <li><b>Part 4:</b> Materializing Data</a></li>
    <li><b>Part 5:</b> Data Operations: Grouping, Aggregation, and Shuffling</a></li>
    <li><b>Part 6:</b> Persisting Data</a></li>
</ul>
</div>


## Imports

In [None]:
import numpy as np
import torch
from torchvision.transforms import Compose, ToTensor, Normalize

import ray

## 1. When to use Ray Data

Use Ray Data to load and preprocess data for distributed ML workloads. Compared to other loading solutions, Datasets are more flexible and provide [higher overall performance](https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant). Ray Data is especially performant when needing to run pre-processing in a **streaming fashion** across a **large dataset** on a **heterogeneous cluster of CPUs and GPUs**.


Use Datasets as a last-mile bridge from storage or ETL pipeline outputs to distributed applications and libraries in Ray. 

<img src='https://docs.ray.io/en/releases-2.34.0/_images/dataset-loading-1.svg' width=60%/>


## 2. Loading Data

Datasets uses Ray tasks to read data from remote storage. When reading from a file-based datasource (e.g., S3, GCS), it creates a number of read tasks proportional to the number of CPUs in the cluster. Each read task reads its assigned files and produces an output block:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/dataset-read-cropped-v2.svg" width="500px">

Let's load some `MNIST` data from s3.

In [None]:
# Here is our dataset it contains 50 images per class
!aws s3 ls s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/

We will use the `read_images` function to load the image data.

In [None]:
ds = ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", include_paths=True)
ds

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions.

### Dataset

A Dataset consists of a list of Ray object references to *blocks*. Having multiple blocks in a dataset allows for parallel transformation and ingest.

The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:

<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>

Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.

## 3. Transforming Data

Ray Data can use either Ray tasks or Ray actors to transform datasets. Using actors allows for expensive state initialization (e.g., for GPU-based tasks) to be cached.

Ray Data simplifies general purpose parallel GPU and CPU compute in Ray. 

Here is a sample data pipeline for streaming image data across a classification and segmentation model on a heterogenous cluster of CPUs and GPUs.

<img src='https://docs.ray.io/en/releases-2.6.1/_images/stream-example.png' width=60%/>

To transform data, we can use the `map_batches` API. This API allows us to apply a transformation to each batch of data.

In [None]:
def normalize(
    batch: dict[str, np.ndarray], min_: float, max_: float
) -> dict[str, np.ndarray]:
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    batch["image"] = [transform(image) for image in batch["image"]]
    return batch


ds_normalized = ds.map_batches(normalize, fn_kwargs={"min_": 0, "max_": 255})
ds_normalized

### Execution mode

Most transformations are **lazy**. They don't execute until you write a dataset to storage or decide to materialize/consume the dataset.

To materialize a very small subset of the data, you can use the `take_batch` method.

In [None]:
normalized_batch = ds_normalized.take_batch(batch_size=10)

for image in normalized_batch["image"]:
    assert image.shape == (1, 28, 28) # channel, height, width
    assert image.min() >= -1 and image.max() <= 1 # normalized to [-1, 1]

<div class="alert alert-block alert-info">

### Activity: Add the ground truth label using the image path.

In this activity, you will add the ground truth label using the image path.

The image path is in the format of `s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/{label}/{image_id}.png`.

See the suggested code below:

```python
# Hint: define the add_label function

ds_labeled = ds_normalized.map_batches(add_label)
labeled_batch = ds_labeled.take_batch(10)
print(labeled_batch["ground_truth_label"])
```


</div>


In [None]:
# Write your solution here

<div class="alert alert-block alert-info">

<details>

<summary>Click to view solution</summary>

```python
def add_label(batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    batch["ground_truth_label"] = [int(path.split("/")[-2]) for path in batch["path"]]
    return batch

ds_labeled = ds_normalized.map_batches(add_label)
labeled_batch = ds_labeled.take_batch(10)
print(labeled_batch["ground_truth_label"])
```

</details>  
</div>

### Stateful transformations with actors

In cases like batch inference, you want to spin up a number of actor processes that are initialized once with your model and reused to process multiple batches.

To implement this, you can use the `map_batches` API with a "Callable" class method that implements:

- `__init__`: Initialize any expensive state.
- `__call__`: Perform the stateful transformation.

For example, we can implement a `MNISTClassifier` that:
- loads a pre-trained model from a local file
- accepts a batch of images and generates the predicted label

In [None]:
class MNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to("cuda")

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

In [None]:
# We download the model from s3 to an EFS storage
!aws s3 cp s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt /mnt/cluster_storage/model.pt

We can now use the `map_batches` API to apply the transformation to each batch of data.

In [None]:
ds_preds = ds_normalized.map_batches(
    MNISTClassifier,
    fn_constructor_kwargs={"local_path": "/mnt/cluster_storage/model.pt"},
    num_gpus=0.1,
    concurrency=1,
    batch_size=100,
)

<div class="alert alert-block alert-warning">

<b>Note:</b> We pass in the Callable class uninitialized. Ray will pass in the arguments to the class constructor when the class is actually used in a transformation.

</div>

In [None]:
batch_preds = ds_preds.take_batch(100)

In [None]:
batch_preds

## 4. Materializing Data

You can choose to materialize the entire dataset into the ray object store which is distributed across the cluster, primarily in memory and secondarily spilling to disk.

To materialize the dataset, we can use the `materialize()` method.

Use this **only** when you require the full dataset to compute downstream outputs.

In [None]:
ds_preds.materialize()

## 5. Data Operations: Grouping, Aggregation, and Shuffling

Let's look at some more involved transformations.

#### Custom batching using `groupby`. 

In case you want to generate batches according to a specific key, you can use `groupby` to group the data by the key and then use `map_groups` to apply the transformation.

For instance, let's compute the accuracy of the model by "ground truth label".

In [None]:
def add_label(batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    batch["ground_truth_label"] = [int(path.split("/")[-2]) for path in batch["path"]]
    return batch


def compute_accuracy(group: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    return {
        "accuracy": [np.mean(group["predicted_label"] == group["ground_truth_label"])],
        "ground_truth_label": group["ground_truth_label"][:1],
    }


ds_preds.map_batches(add_label).groupby("ground_truth_label").map_groups(compute_accuracy).to_pandas()

<div class="alert alert-block alert-warning">

<b>Note:</b> ds_preds is not re-computed given we have already materialized the dataset.

</div>

### Aggregations

Ray Data also supports a variety of aggregations. For instance, we can compute the mean accuracy across the entire dataset.

In [None]:
ds_preds.map_batches(add_label).map_batches(compute_accuracy).mean(on="accuracy")

As of version 2.34.0, Ray Data provides the following aggregation functions:

- `count`
- `max`
- `mean`
- `min`
- `sum`
- `std`

See relevant [docs page here](https://docs.ray.io/en/latest/data/api/grouped_data.html#ray.data.aggregate.AggregateFn).

### Shuffling data 

There are different options to shuffle data in Ray Data of varying degrees of randomness and performance.

#### File based shuffle on read

To randomly shuffle the ordering of input files before reading, call a read function that supports shuffling, such as `read_images()`, and use the shuffle="files" parameter.

In [None]:
ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", shuffle="files")

#### Shuffling block order
This option randomizes the order of blocks in a dataset. Blocks are the basic unit of data chunk that Ray Data stores in the object store. Applying this operation alone doesn’t involve heavy computation and communication. However, it requires Ray Data to materialize all blocks in memory before applying the operation. Only use this option when your dataset is small enough to fit into the object store memory.

To perform block order shuffling, use `randomize_block_order`.

In [None]:
ds_randomized_blocks = ds_preds.randomize_block_order()
ds_randomized_blocks.materialize()

#### Shuffle all rows globally
To randomly shuffle all rows globally, call `random_shuffle()`. This is the slowest option for shuffle, and requires transferring data across network between workers. This option achieves the best randomness among all options.


In [None]:
ds_randomized_rows = ds_preds.random_shuffle()
ds_randomized_rows.materialize()

## 5. Persisting Data

Finally, you can persist a dataset to storage using any of the "write" functions that Ray Data supports.

Lets write our predictions to a parquet dataset.

In [None]:
ds_preds.write_parquet("/mnt/cluster_storage/mnist_preds")

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of write functions.

In [None]:
# cleanup
!rm -rf /mnt/cluster_storage/mnist_preds