# Introduction to Batch Training on Ray AIR/Data

### Learning objectives
In this this tutorial, you will learn about:
 * [Ray Dataset](#dataset)
 * [AIR Trainer](#trainer)
 * [Ray Tune](#tune)

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training using [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html), [Ray AIR Trainers](https://docs.ray.io/en/master/ray-air/trainer.html#air-trainers), and [Ray Tune](https://docs.ray.io/en/master/ray-air/tuner.html).

For the data, we will use the [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).  This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC.  <s>The goal is to predict future, hourly taxi demand by location in NYC.</s>
The goal is to perform batch training & tuning using scikit-learn (treating this as a simple regression problem to predict `trip_duration`).

A separate model will be trained for each pickup location. We can use the pickup_location_id column in the dataset to group the dataset into data batches. We will then fit models for each batch and choose the best one.

Let’s start by importing a few required libraries, including open-source [Ray](https://github.com/ray-project/ray) itself.

In [1]:
import sys, os
import pandas as pd
# import numpy as np
import pyarrow as pa
import ray

# Local code
# sys.path.insert( 0, os.path.abspath("../local_util") )
# import dataprep  # From this repository's SSML/local_utils folder
# import utility functions
from local_utils import dataprep

# import os, warnings
# warnings.filterwarnings("ignore")

num_available_cpus = os.cpu_count()
print(f'Number of CPUs in this system: {num_available_cpus}')

# # AWS
# import boto3              # AWS SDK for Python
# import s3fs               # AWS SDK for s3-to-pandas 

Number of CPUs in this system: 8


# Data <a class="anchor" id="dataset"></a>

Next, read some data using Ray Dataset.   This will initialize a Ray cluster.  Then we can use the [Ray Dataset](https://docs.ray.io/en/latest/data/getting-started.html#datasets-getting-started) APIs to quickly inspect the data.

In [2]:
# Read some Parquet files in parallel.
data_files = \
[
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_01_data.parquet",
    # "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_02_data.parquet"
]

ds = ray.data.read_parquet(data_files)
print(type(ds))

2022-10-08 08:28:43,023	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m


<class 'ray.data.dataset.Dataset'>


In [3]:
# Parquet stores the number of rows per file in the Parquet metadata, 
# so we can get the number of rows in ds without triggering a full data read!
print(f"Number rows: {ds.count()}")

# Display some metadata about the dataset.
print("\nMetadata: ")
print(ds)

# Fetch the schema from the underlying Parquet metadata.
print("\nSchema:")
print(ds.schema())

# Take a peek at a single row
print("\nLook at a sample row:")
ds.take(1)

Number rows: 1410617

Metadata: 
Dataset(num_blocks=1, num_rows=1410617, schema={vendor_id: string, pickup_at: timestamp[us], dropoff_at: timestamp[us], passenger_count: int8, trip_distance: float, pickup_longitude: float, pickup_latitude: float, rate_code_id: null, store_and_fwd_flag: string, dropoff_longitude: float, dropoff_latitude: float, payment_type: string, fare_amount: float, extra: float, mta_tax: float, tip_amount: float, tolls_amount: float, total_amount: float})

Schema:
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2524

Look at a sample row:


[ArrowRow({'vendor_id': 'VTS',
           'pickup_at': datetime.datetime(2009, 1, 21, 14, 58),
           'dropoff_at': datetime.datetime(2009, 1, 21, 15, 3),
           'passenger_count': 1,
           'trip_distance': 0.5299999713897705,
           'pickup_longitude': -73.99270629882812,
           'pickup_latitude': 40.7529411315918,
           'rate_code_id': None,
           'store_and_fwd_flag': None,
           'dropoff_longitude': -73.98814392089844,
           'dropoff_latitude': 40.75956344604492,
           'payment_type': 'CASH',
           'fare_amount': 4.5,
           'extra': 0.0,
           'mta_tax': None,
           'tip_amount': 0.0,
           'tolls_amount': 0.0,
           'total_amount': 4.5})]

Normally there is some data exploration to determine the cleaning steps.  Let's just assume we know the data cleaning steps are:
- Drop negative trip distances, 0 fares, 0 passengers, less than 1min trip durations
- Drop 2 unknown zones ['264', '265']
- Calculate trip duration in minutes and add it as a new column
- Groupby, aggregate sum taxi rides, hourly per pickup location


In [4]:
# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(the_df: pd.DataFrame) -> pd.DataFrame:
    df = the_df.copy()
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_distance"] > 0]
    df = df[df["fare_amount"] > 0]
    df = df[df["passenger_count"] > 0]
    df = df[df["trip_duration"] >= 60]
    return df

# batch_format="pandas" tells Datasets to provide the transformer with blocks
# represented as Pandas DataFrames.
print(ds.count())
ds = ds.map_batches(transform_batch, batch_format="pandas")

# verify row count
ds_rows = ds.count()
print(f"Final number rows: {ds_rows}")

# approx 20K rows were dropped with that cleaning


1410617


Read->Map_Batches: 100%|██████████████████████████| 1/1 [00:11<00:00, 11.17s/it]

Final number rows: 1390337





<b>Filter on Read - Projection and Filter Pushdown</b>

Note that Ray Datasets' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk!

The row-based filter is specified via [Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}feature guide for reading Parquet data <dataset_supported_file_formats> for more information.


In [5]:
import pyarrow as pa
filter_expr = (
    (pa.dataset.field("passenger_count") > 0)
    & (pa.dataset.field("trip_distance") > 0)
    & (pa.dataset.field("fare_amount") > 0)
)

pushdown_ds = ray.data.read_parquet(
    data_files,
    columns=['pickup_at', 'dropoff_at',
    'passenger_count', 'trip_distance', 'fare_amount'], 
    filter=filter_expr,
)

# Force full execution of both of the file reads.
pushdown_ds = pushdown_ds.fully_executed()

print(f"Number rows: {pushdown_ds.count()}")
# Display some metadata about the dataset.
print("\nMetadata: ")
print(pushdown_ds)
# Fetch the schema from the underlying Parquet metadata.
print("\nSchema:")
print(pushdown_ds.schema())
# Take a peek at a single row
print("\nLook at a sample row:")
pushdown_ds.take(1)


Read progress: 100%|██████████████████████████████| 1/1 [00:04<00:00,  4.41s/it]

Number rows: 1398850

Metadata: 
Dataset(num_blocks=1, num_rows=1398850, schema={pickup_at: timestamp[us], dropoff_at: timestamp[us], passenger_count: int8, trip_distance: float, fare_amount: float})

Schema:
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
fare_amount: float
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2524

Look at a sample row:





[ArrowRow({'pickup_at': datetime.datetime(2009, 1, 21, 14, 58),
           'dropoff_at': datetime.datetime(2009, 1, 21, 15, 3),
           'passenger_count': 1,
           'trip_distance': 0.5299999713897705,
           'fare_amount': 4.5})]

<b>Custom data transform functions</b>

Ray Datasets allows you to specify custom data transform functions using familiar syntax, such as Pandas.  These <b>custom functions, or UDFs,</b> can be called using `ds.map_batches(my_UDF, batch_format="pandas")`.  It is necessary to specify the language you are using the `batch_format parameter`.

TODO: Reference link for syntax supported in Datasets UDFs

In [6]:
# perform a simpler filter and compare 
# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch2(the_df: pd.DataFrame) -> pd.DataFrame:
    df = the_df.copy()
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_duration"] >= 60]
    df.drop(["dropoff_at", "pickup_at"], axis=1, inplace=True)
    return df

# batch_format="pandas" tells Datasets to provide the transformer with blocks
# represented as Pandas DataFrames.
print(pushdown_ds.count())
pushdown_ds = pushdown_ds.map_batches(transform_batch2, batch_format="pandas")

# verify row count
pushdown_rows = pushdown_ds.count()
print(f"Final number rows: {pushdown_rows}")
assert ds_rows == pushdown_rows

# Replace ds with pushdown
ds = pushdown_ds


1398850


Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00,  1.90it/s]

Final number rows: 1390337





<b>Random shuffle</b>

Randomly shuffling data is an important part of training machine learning models: it decorrelates samples, preventing overfitting and improving generalization. For many models, even between-epoch shuffling can drastically improve the precision gain per step/epoch. Datasets has a hyper-scalable distributed random shuffle that allows you to realize the model accuracy benefits of per-epoch shuffling without sacrificing training throughput, even at large data scales and even when doing distributed data-parallel training across multiple GPUs/nodes.

In [7]:
# do a full global random shuffle to decorrelate the data
ds = ds.random_shuffle()

Shuffle Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 12.74it/s]
Shuffle Reduce: 100%|█████████████████████████████| 1/1 [00:00<00:00, 14.71it/s]


<b>Split data into train/valid/test </b> 

We are ready to split the data into train/valid/test.  For now, we will just randomly split the data into 80/20 train/test.


In [8]:
target = "trip_duration"

# Split data into train and validation.
train_ds, valid_ds = ds.train_test_split(test_size=0.2)

# Create a test dataset by dropping the target column.
test_ds = valid_ds.drop_columns(cols=[target])

assert train_ds.count() + valid_ds.count() == ds.count()
print(f"Number rows train, test: ", end="")
print(f"{train_ds.count()}, {test_ds.count()}")

Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00, 25.42it/s]

Number rows train, test: 1112269, 278068





<b>Tidying up</b>

To make our code more modular and easier to read, let's put all those data processing steps into a single function called `dataprep.prepare_data()`.  See [scripts](../scripts/dataprep.py) for the full code.


In [9]:
# Single function call for preparing data using Ray Dataset
train_ds, valid_ds, test_ds = dataprep.prepare_data(data_files, target)

print(f"Number rows train, test: ", end="")
print(f"{train_ds.count()}, {test_ds.count()}")
test_ds.take(1)


Read progress: 100%|██████████████████████████████| 1/1 [00:04<00:00,  4.45s/it]
Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00,  2.02it/s]
Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00,  4.64it/s]
Shuffle Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 14.50it/s]
Shuffle Reduce: 100%|█████████████████████████████| 1/1 [00:00<00:00, 13.57it/s]
Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00, 23.28it/s]

Number rows train, test: 1112269, 278068





[PandasRow({'passenger_count': 1,
            'trip_distance': 0.9700000286102295,
            'fare_amount': 5.699999809265137})]

In [10]:
test_ds.take(1)

[PandasRow({'passenger_count': 1,
            'trip_distance': 0.9700000286102295,
            'fare_amount': 5.699999809265137})]

# AIR Trainer <a class="anchor" id="trainer"></a>

Ray AI Runtime (AIR) is a scalable and unified toolkit for ML applications.  AIR builds on Ray’s best-in-class libraries for Preprocessing, Training, Tuning, Scoring, Serving, and Reinforcement Learning to bring together an ecosystem of integrations.

In [11]:
from ray.air.config import ScalingConfig
from ray.train.sklearn import SklearnTrainer, SklearnPredictor
from ray.data.preprocessors import Chain, OrdinalEncoder, StandardScaler
from ray.air.result import Result

from sklearn.linear_model import LinearRegression

print(f'Number of CPUs in this system: {num_available_cpus}')

Number of CPUs in this system: 8


<b>Scaling decisions</b>

By default, Dataset tasks use all available cluster CPU resources for execution. This can sometimes conflict with Trainer resource requests. For example, if Trainers allocate all CPU resources in the cluster, then no Datasets tasks can run.

A good rule of thumb, if you know you need to do other things besides Train, is to reserve a couple CPUs for those other purposes besides training.

In [None]:
# decide how many processors to use for training
num_training_cpus = num_available_cpus - 2

# assign training resources to AIR Trainer
trainer_resources = {"CPU": num_training_cpus}

In [12]:
# Create a preprocessor to scale some columns.
from ray.data.preprocessors import StandardScaler

preprocessor = StandardScaler(
    columns=["passenger_count", 
             "fare_amount", ])

In [13]:
trainer = SklearnTrainer(
    estimator=LinearRegression(),
    label_column=target,
    datasets={"train": train_ds, "valid": valid_ds},
    preprocessor=preprocessor,
    cv=5,
    scaling_config=ScalingConfig(trainer_resources=trainer_resources),
)
result = trainer.fit()
print(result.metrics)

Trial name,status,loc,iter,total time (s),fit_time
SklearnTrainer_f7f56_00000,TERMINATED,127.0.0.1:8790,1,1.89947,0.101401


Result for SklearnTrainer_f7f56_00000:
  cv:
    fit_time: [0.09510993957519531, 0.07595109939575195, 0.07802295684814453, 0.07298398017883301,
      0.06866836547851562]
    fit_time_mean: 0.07814726829528809
    fit_time_std: 0.009045219410326244
    score_time: [0.006638050079345703, 0.003111124038696289, 0.004126071929931641, 0.0031020641326904297,
      0.0026056766510009766]
    score_time_mean: 0.003916597366333008
    score_time_std: 0.0014478224514137114
    test_score: [0.0021971192522125538, 0.0018830469547053141, 0.0021269945373810772,
      0.0023612663349836804, 0.0013532884232091424]
    test_score_mean: 0.0019843431004983535
    test_score_std: 0.0003510513246689167
  date: 2022-10-08_08-29-19
  done: false
  experiment_id: 96cd084697fc46ffa34ab1655d85e11d
  fit_time: 0.10140085220336914
  hostname: Christys-MacBook-Pro.local
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  pid: 8790
  should_checkpoint: true
  time_since_restore: 1.8994650840759277
  time_this_iter

2022-10-08 08:29:19,954	INFO tune.py:758 -- Total run time: 4.15 seconds (3.38 seconds for the tuning loop).


{'valid': {'score_time': 0.005650997161865234, 'test_score': 0.002360176574100592}, 'cv': {'fit_time': array([0.09510994, 0.0759511 , 0.07802296, 0.07298398, 0.06866837]), 'score_time': array([0.00663805, 0.00311112, 0.00412607, 0.00310206, 0.00260568]), 'test_score': array([0.00219712, 0.00188305, 0.00212699, 0.00236127, 0.00135329]), 'fit_time_mean': 0.07814726829528809, 'fit_time_std': 0.009045219410326244, 'score_time_mean': 0.003916597366333008, 'score_time_std': 0.0014478224514137114, 'test_score_mean': 0.0019843431004983535, 'test_score_std': 0.0003510513246689167}, 'fit_time': 0.10140085220336914, 'time_this_iter_s': 1.8994650840759277, 'should_checkpoint': True, 'done': True, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 1, 'trial_id': 'f7f56_00000', 'experiment_id': '96cd084697fc46ffa34ab1655d85e11d', 'date': '2022-10-08_08-29-19', 'timestamp': 1665242959, 'time_total_s': 1.8994650840759277, 'pid': 8790, 'hostname': 'Christys-MacBook-Pro.local', 'node

In [14]:
# def train_sklearn(num_cpus: int, use_gpu: bool = False) -> Result:
#     if use_gpu and not cuMLRandomForestClassifier:
#         raise RuntimeError("cuML must be installed for GPU enabled sklearn estimators.")

#     train_dataset, valid_dataset, _ = prepare_data()

#     # Scale some random columns
#     columns_to_scale = ["mean radius", "mean texture"]
#     preprocessor = Chain(
#         OrdinalEncoder(["categorical_column"]), StandardScaler(columns=columns_to_scale)
#     )

#     if use_gpu:
#         trainer_resources = {"CPU": 1, "GPU": 1}
#         estimator = cuMLRandomForestClassifier()
#     else:
#         trainer_resources = {"CPU": num_cpus}
#         estimator = RandomForestClassifier()

#     trainer = SklearnTrainer(
#         estimator=estimator,
#         label_column="target",
#         datasets={"train": train_dataset, "valid": valid_dataset},
#         preprocessor=preprocessor,
#         cv=5,
#         scaling_config=ScalingConfig(trainer_resources=trainer_resources),
#     )
#     result = trainer.fit()
#     print(result.metrics)

#     return result
