# Intro to Ray Data

## Ray Datasets: Distributed Data Preprocessing

Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as maps ([`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches "ray.data.Dataset.map_batches")), global and grouped aggregations ([`GroupedDataset`](https://docs.ray.io/en/latest/data/api/doc/ray.data.grouped_dataset.GroupedDataset.html#ray.data.grouped_dataset.GroupedDataset "ray.data.grouped_dataset.GroupedDataset")), and shuffling operations ([`random_shuffle`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.random_shuffle.html#ray.data.Dataset.random_shuffle "ray.data.Dataset.random_shuffle"), [`sort`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.sort.html#ray.data.Dataset.sort "ray.data.Dataset.sort"), [`repartition`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.repartition.html#ray.data.Dataset.repartition "ray.data.Dataset.repartition")), and are compatible with a variety of file formats, data sources, and distributed frameworks.

Here's an overview of the integrations with other processing frameworks, file formats, and supported operations, as well as a glimpse at the Ray Datasets API.

Check the [Input/Output reference](https://docs.ray.io/en/latest/data/api/input_output.html#input-output) to see if your favorite format is already supported.

### Datasets for Parallel Compute
-------------------------------------------------------------------------------------------------------------------------------------------

Datasets also simplify general purpose parallel GPU and CPU compute in Ray; for instance, for [GPU batch inference](https://docs.ray.io/en/latest/ray-overview/use-cases.html#ref-use-cases-batch-infer). They provide a higher-level API for Ray tasks and actors for such embarrassingly parallel compute, internally handling operations like batching, pipelining, and memory management.

<img src='https://docs.ray.io/en/releases-2.6.1/_images/stream-example.png' width=60%/>

As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Ray's distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling.

In [None]:
import ray

ds = ray.data.read_csv("s3://anyscale-public-materials/ray-ai-libraries/Iris.csv")

ds

In [None]:
ds.show(3)

In [None]:
ds.write_parquet('/mnt/cluster_storage/parquet_iris')

<div class="alert alert-block alert-warning">
<b>Note</b> 

We  write to `/mnt/cluster_storage`. This is a path that is available on all nodes in the cluster. If instead you use a path that is only local to one of the nodes in a multi-node cluster, you will see errors like `FileNotFoundError: [Errno 2] No such file or directory: '/path/to/file'`.

This is because Ray Data is designed to work with distributed storage systems like S3, HDFS, etc. If you want to write to local storage, you can add a special prefix `local://` to the path. In this case, Ray will only schedule tasks on the head node of the cluster.
</div>

In [None]:
! ls -l /mnt/cluster_storage/parquet_iris/

### Transforming Data[](https://docs.ray.io/en/latest/data/key-concepts.html#transforming-data "Permalink to this headline")

Datasets can use either Ray tasks or Ray actors to transform datasets. Using actors allows for expensive state initialization (e.g., for GPU-based tasks) to be cached.

In [None]:
ds = ds.repartition(5)

ds

In [None]:
ds.take_batch(5)

### Shuffling Data[](https://docs.ray.io/en/latest/data/key-concepts.html#shuffling-data "Permalink to this headline")

Certain operations like *sort* or *groupby* require data blocks to be partitioned by value, or *shuffled*. Datasets uses tasks to implement distributed shuffles in a map-reduce style, using map tasks to partition blocks by value, and then reduce tasks to merge co-partitioned blocks together.

You can also change just the number of blocks of a Dataset using [`repartition()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.repartition.html#ray.data.Dataset.repartition "ray.data.Dataset.repartition"). Repartition has two modes:

1.  `shuffle=False` - performs the minimal data movement needed to equalize block sizes __(the default)__

2.  `shuffle=True` - performs a full distributed shuffle __(not the default because it's more expensive)__

<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-shuffle.svg' width=60%/>

Datasets shuffle can scale to processing hundreds of terabytes of data. See the [Performance Tips Guide](https://docs.ray.io/en/latest/data/performance-tips.html#shuffle-performance-tips) for an in-depth guide on shuffle performance.

In [None]:
def transform_batch(batch):
    areas = []
    for ix in range(len(batch['Id'])):
        areas.append(batch["PetalLengthCm"][ix] * batch["PetalWidthCm"][ix])        
    batch['approximate_petal_area'] = areas
    return batch

ds.map_batches(transform_batch).show(5)

In [None]:
ds.map_batches(transform_batch).take_batch(5)

### More example operations: transforming Datasets

Datasets transformations take in datasets and produce new datasets. For example, *map* is a transformation that applies a user-defined function on each dataset record and returns a new dataset as the result. Datasets transformations can be composed to express a chain of computations.

There are two main types of transformations:

-   One-to-one: each input block will contribute to only one output block, such as [`ds.map_batches()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches "ray.data.Dataset.map_batches").

-   All-to-all: input blocks can contribute to multiple output blocks, such as [`ds.random_shuffle()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.random_shuffle.html#ray.data.Dataset.random_shuffle "ray.data.Dataset.random_shuffle").

Here is a table listing some common transformations supported by Ray Datasets.

Common Ray Datasets transformations.[](https://docs.ray.io/en/latest/data/transforming-datasets.html#id2 "Permalink to this table")

| Transformation | Type | Description |
| --- | --- | --- |
|[`ds.map_batches()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches "ray.data.Dataset.map_batches")|One-to-one|Apply a given function to batches of records of this dataset.|
|[`ds.add_column()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.add_column.html#ray.data.Dataset.add_column "ray.data.Dataset.add_column")|One-to-one|Apply a given function to batches of records to create a new column.|
|[`ds.drop_columns()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.add_column.html#ray.data.Dataset.add_column "ray.data.Dataset.add_column")|One-to-one|Drop the given columns from the dataset.|
|[`ds.split()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.split.html#ray.data.Dataset.split "ray.data.Dataset.split")|One-to-one|Split the dataset into N disjoint pieces.|
|[`ds.repartition(shuffle=False)`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.repartition.html#ray.data.Dataset.repartition "ray.data.Dataset.repartition")|One-to-one|Repartition the dataset into N blocks, without shuffling the data.|
|[`ds.repartition(shuffle=True)`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.repartition.html#ray.data.Dataset.repartition "ray.data.Dataset.repartition")|All-to-all|Repartition the dataset into N blocks, shuffling the data during repartition.|
|[`ds.random_shuffle()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.random_shuffle.html#ray.data.Dataset.random_shuffle "ray.data.Dataset.random_shuffle")|All-to-all|Randomly shuffle the elements of this dataset.|
|[`ds.sort()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.sort.html#ray.data.Dataset.sort "ray.data.Dataset.sort")|All-to-all|Sort the dataset by a sortkey.|
|[`ds.groupby()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.groupby.html#ray.data.Dataset.groupby "ray.data.Dataset.groupby")|All-to-all|Group the dataset by a groupkey.|

> __Tip__
>
> Datasets also provides the convenience transformation methods [`ds.map()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map.html#ray.data.Dataset.map "ray.data.Dataset.map"), [`ds.flat_map()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.flat_map.html#ray.data.Dataset.flat_map "ray.data.Dataset.flat_map"), and [`ds.filter()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.filter.html#ray.data.Dataset.filter "ray.data.Dataset.filter"), which are not vectorized (slower than [`ds.map_batches()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches "ray.data.Dataset.map_batches")), but may be useful for development.

In [None]:
# Group by the variety.
ds.groupby("Species").count().show()

### Execution mode

Most transformations are **lazy** in Ray Data - i.e. they don't execute until you either:
- **write a dataset to storage**
- explicitly **materialize** the data
- **iterate over the dataset** (usually when feeding data to model training).

To explicitly *materialize* a very small subset of the data, you can use the `take_batch` method.
Most transformations are lazy. They don't execute until you consume a dataset or call [`Dataset.materialize()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.materialize.html#ray.data.Dataset.materialize "ray.data.Dataset.materialize").

The transformations are executed in a streaming way, incrementally on the data and with operators processed in parallel. For an in-depth guide on Datasets execution, read https://docs.ray.io/en/releases-2.8.0/data/data-internals.html

> To force computation and local caching of the entire dataset, you can used `materialize`. Consider the performance constraints and impacts of caching, though

In [None]:
ds.materialize()

### Transforming data with actors

When using the actor compute strategy, per-row and per-batch UDFs can also be *callable classes*, i.e. classes that implement the `__call__` magic method. The constructor of the class can be used for stateful setup, and will be only invoked once per worker actor.

<div class="alert alert-block alert-warning">
<b>Note:</b> These transformation APIs take the uninstantiated callable class as an argument, not an instance of the class.
</div>

In [None]:
class ModelExample:
    def __init__(self):
        expensive_model_weights = [ 0.3, 1.75 ]
        self.complex_model = lambda petal_width: (petal_width + expensive_model_weights[0])  ** expensive_model_weights[1]

    def __call__(self, batch):
        batch["predictions"] = self.complex_model(batch["PetalWidthCm"])
        return batch

ds.map_batches(ModelExample, concurrency=2).show(10)

<div class="alert alert-block alert-success">
    
__Lab activity: Stateless transformation__
    
1. Create a Ray Dataset from the iris data in `/mnt/cluster_storage/parquet_iris`
1. Create a "sum of features" transformation that calculates the sum of the Sepal Length, Sepal Width, Petal Length, and Petal Width features for the records
    1. Design this transformation to take a Ray Dataset *batch* of records
    1. Return the records without the ID column but with an additional column called "sum"
    1. Hint: you do not need to use NumPy, but the calculation may be easier/simpler to code using NumPy vectorized operations with the records in the batch
</div>
