# Running N Independent Trials in Parallel with Ray AIR Tune

# Introduction 

Let us quickly walk through the key concepts you need to know to use Tune.

First, you define the hyperparameters you want to tune in a `Search space` and pass them into a `Trainable` or `callable` function, that specifies the objective you want to tune.  The trainable function will be called for every instance in the search space.

Then you select a search algorithm and optionally use a
scheduler to stop searches early and speed up your experiments.  We will use simple **grid search** as our `search algorithm` in this tutorial.

Together with other configurations, your Trainable, Search algorithm, and Scheduler are passed into `Tuner`, which runs your experiments and creates parallel `trials`.

These trials can then be used in analyses to inspect your experiment results.
The following figure shows an overview of these components, which we will cover in this tutorial.

![Tune flow diagram](../images/tune_flow.png)

# Contents

In this this tutorial, you will learn about:
 1. [Load and Prepare Parquet data](#load_data)
 2. [Define a Trainable (callable) function](#define_trainable)
 2. [Define your Ray Tune Search Space](#define_search_space)
 3. [Run independent trials in Parallel using Ray AIR Tune](#run_tune_search)
 4. [Load a model from checkpoint and perform inference](#load_checkpoint)

# Walkthrough

Let us start by importing a few required libraries, including open-source [Ray](https://github.com/ray-project/ray) itself!

In [1]:
import os
print(f'Number of CPUs in this system: {os.cpu_count()}')
from typing import Tuple, List, Union, Optional, Callable
import time
import pandas as pd
import numpy as np
import pyarrow
import pyarrow.parquet as pq
import pyarrow.dataset as pds
print(f"pyarrow: {pyarrow.__version__}")

Number of CPUs in this system: 8
pyarrow: 6.0.1


In [2]:
import ray

if ray.is_initialized():
    ray.shutdown()
ray.init()

2022-11-08 01:31:58,070	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


0,1
Python version:,3.8.13
Ray version:,2.0.1
Dashboard:,http://127.0.0.1:8265


In [3]:
# For benchmarking purposes, we can print the times of various operations.
# In order to reduce clutter in the output, this is set to False by default.
PRINT_TIMES = False

def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)
        
# To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019.
# You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False.
SMOKE_TEST = True


## Load and Prepare Parquet data <a class="anchor" id="load_data"></a>

Apache Arrow uses a [C++ implementation of Apache Parquet](http://github.com/apache/parquet-cpp), which includes a native, multithreaded C++ adapter. This makes **PyArrow dataset and table faster for large parquet file reading than pandas read_parquet, even with engine=pyarrow**. 
 
Below, first we will filter files into a PyArrow dataset. In the next cell after, we will filter the data on read into a PyArrow table.  The PyArrow table will be converted after filtering to a pandas dataframe.

In [4]:
# Define some global variables.
target = "trip_duration"
s3_partitions = pds.dataset(
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
    partitioning=["year", "month"],
)
s3_files = [f"s3://{file}" for file in s3_partitions.files]

# Obtain all location IDs
all_location_ids = (
    pq.read_table(s3_files[0], columns=["dropoff_location_id"])["dropoff_location_id"]
    .unique()
    .to_pylist()
)

# Use smoke testing or not.
starting_idx = -1 if SMOKE_TEST else 0
sample_locations = [145, 166, 152] if SMOKE_TEST else all_location_ids

# Display what data will be used.
s3_files = s3_files[starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")
print(f"s3_files: {s3_files}")
print(f"Locations: {sample_locations}")


NYC Taxi using 1 file(s)!
s3_files: ['s3://air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet']
Locations: [145, 166, 152]


In [5]:
# Function to read a pyarrow.Table object using pyarrow parquet 
def read_data(file: str, sample_id: np.int32) -> pd.DataFrame:
    
    df = pq.read_table(
        file,
        filters=[
            ("passenger_count", ">", 0),
            ("trip_distance", ">", 0),
            ("fare_amount", ">", 0),
            ("pickup_location_id", "in", [264, 265]),
            ("dropoff_location_id", "not in", [264, 265]), 
            ("dropoff_location_id", "=", sample_id)
        ],
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
            "passenger_count",
            "trip_distance",
            "fare_amount",
        ],
    ).to_pandas()

    return df

# Function to transform a pandas dataframe
def transform_df(the_df: pd.DataFrame) -> pd.DataFrame:
    df = the_df.copy()
    
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_duration"] > 60]
    df = df[df["trip_duration"] < 24 * 60 * 60] 
    df.drop(["dropoff_at", "pickup_at", "pickup_location_id", "fare_amount"]
            , axis=1, inplace=True)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

## Define a Trainable (callable) function <a class="anchor" id="define_trainable"></a>

In [6]:
# import standard sklearn libraries
import sklearn
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
print(f"sklearn: {sklearn.__version__}")

# import ray AIR libraries
from ray import air, tune
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import ScalingConfig
from ray.train.sklearn import SklearnCheckpoint
from ray.train.batch_predictor import BatchPredictor
from ray.train.sklearn import SklearnPredictor

# set global random seed for sklearn models
np.random.seed(415)

sklearn: 1.1.2


<b>Define a "Trainable" object that you can pass into a Tune run.</b> Ray Tune has two ways of defining a trainable, namely the [Function API](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#trainable-docs) and the Class API. Both are valid ways of defining a trainable, but *the Function API is generally recommended*.

Let’s say we want to optimize a simple objective function like a (x ** 2) + b in which a and b are the hyperparameters we want to tune to minimize the objective. Since the objective also has a variable x, we need to test for different values of x. Given concrete choices for a, b and x we can evaluate the objective function and get a score to minimize.

**In the cell below, our "Trainable" function is called `train_model`**.  It takes as input a config dictionary argument, and returns a simple dict output. 

In [7]:
# 1. Define a custom train function
def train_model(config: dict):

    model = config['model']
    the_location = config['location']
    
    # Load data.
    df_list = [read_data(f, the_location) for f in s3_files]   
    df_raw = pd.concat(df_list, ignore_index=True)
    df = transform_df(df_raw)
    
    # Train/test split.
    train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True)
    train_X = train_df[["passenger_count", "trip_distance"]]
    train_y = train_df.trip_duration
    test_X = test_df[["passenger_count", "trip_distance"]]
    test_y = test_df.trip_duration

    # Train model.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    
    # Evaluate.
    error = sklearn.metrics.mean_absolute_error(test_y, pred_y)
    
    # Save the model as a Tune Checkpoint. This is done through 
    # ray.air.session.report() API, which can take in a Checkpoint object.
    # https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html
    
    # Save the model checkpoint
    checkpoint = SklearnCheckpoint.from_estimator(model, path=".")

    # Save the checkpoint using ray.air.session()
    session.report(
            {"error": error}, 
            checkpoint=checkpoint)

## Define your Ray Tune Search Space <a class="anchor" id="define_search_space"></a>

**Next, define a search space of trials to run.** Below, we define a simple grid search of 2 Scikit-learn models and NYC taxi drop-off locations. 

Ray Tune will generate permutations of the grid search parameters and pass these in the config dictionary passed to each Trainable function call.

Besides grid search, learn about other features Tune offers for defining spaces at [Working with Tune Search Spaces](https://docs.ray.io/en/master/tune/tutorials/tune-search-spaces.html#tune-search-space-tutorial).

In [8]:
# 2. Define a search space.
sample_locations = [145, 166, 152] if SMOKE_TEST else all_location_ids
search_space = {
    "model": tune.grid_search([LinearRegression(fit_intercept=True), 
                                DecisionTreeRegressor(max_depth=3)]),
    "location": tune.grid_search(sample_locations),
}

## Run independent trials in Parallel using Ray AIR Tune <a class="anchor" id="run_tune_search"></a>

**Optionally, configure the resources allocated per trial.** Tune uses this resources allocation to control the parallelism. For example, if each trial was configured to use 4 CPUs, and the cluster had only 32 CPUs, then Tune will limit the number of concurrent trials to 8 to avoid overloading the cluster. 

For more information, see [A Guide To Parallelism and Resources](https://docs.ray.io/en/master/tune/tutorials/tune-resources.html#tune-parallelism).

In [9]:
# 3. Can customize resources per trial, here we set 1 CPU each.
train_model = tune.with_resources(train_model, {"cpu": 1})

**Below, we introduce AIR configs and run the trial using `tuner.fit()`.** Tune will report on experiment status, and after the experiment finishes, you can inspect the results. 

Notice in the AIR config, we have specified a local directory `my_Tune_logs` for logging instead of the default `~/ray_results` directory. Learn more about logging Tune results at [How to configure logging in Tune](https://docs.ray.io/en/master/tune/tutorials/tune-output.html#tune-logging).

Tune can retry failed trials automatically, as well as entire experiments; see [Stopping and Resuming a Tune Run](https://docs.ray.io/en/master/tune/tutorials/tune-stopping.html#tune-stopping-guide).

In [10]:
# Define a tuner object using Ray AIR Tuner API
stop_criteria = {
    "done": True,
    "training_iteration": 1 if SMOKE_TEST else 3,
}
tuner = tune.Tuner(
    train_model, 
    param_space=search_space,
    run_config=air.RunConfig(
        
        #redirect logs to relative path instead of default ~/ray_results/
        local_dir = "my_Tune_logs",
        name = "batch_tuning",
        
        # No custom syncing
        sync_config=tune.SyncConfig(
            syncer=None  # Disable custom syncing (uses rsync by default)
        ),
        
        # Checkpoint config
        checkpoint_config=air.CheckpointConfig(
            checkpoint_score_attribute="error",
        ),

        # Stopping criteria whichever occurs first
        stop=stop_criteria,

        # Set Ray Tune verbosity.  Print summary table only with levels 2 or 3.
        verbose=2,
        ),
)

# 4. Run the trials with Ray Tune
start = time.time()
results = tuner.fit()
total_time_taken = time.time() - start

# Assemble Tune results into a pandas dataframe
results_df = results.get_dataframe()

# Print some training stats
print(f"Total number of models: {results_df.shape[0]}")
print(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds")
best_result = results.get_best_result(metric="error", mode="min").config
print(f"Best result: {best_result}")



Trial name,status,loc,location,model,iter,total time (s),error
train_model_39d58_00000,TERMINATED,127.0.0.1:77555,145,LinearRegression(),1,447.438,706.173
train_model_39d58_00001,TERMINATED,127.0.0.1:77564,166,LinearRegression(),1,317.96,272.063
train_model_39d58_00002,TERMINATED,127.0.0.1:77565,152,LinearRegression(),1,421.948,266.036
train_model_39d58_00003,TERMINATED,127.0.0.1:77566,145,DecisionTreeReg_5070,1,438.83,461.833
train_model_39d58_00004,TERMINATED,127.0.0.1:77567,166,DecisionTreeReg_51f0,1,452.476,227.662
train_model_39d58_00005,TERMINATED,127.0.0.1:77568,152,DecisionTreeReg_5370,1,437.406,224.3


2022-11-08 01:37:35,270	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': LinearRegression()}


Trial train_model_39d58_00001 reported error=272.06299675835504,should_checkpoint=True with parameters={'model': LinearRegression(), 'location': 166}. This trial completed.


2022-11-08 01:39:19,151	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': LinearRegression()}


Trial train_model_39d58_00002 reported error=266.03564453125,should_checkpoint=True with parameters={'model': LinearRegression(), 'location': 152}. This trial completed.


2022-11-08 01:39:34,677	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': DecisionTreeRegressor(max_depth=3)}


Trial train_model_39d58_00005 reported error=224.3,should_checkpoint=True with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 152}. This trial completed.


2022-11-08 01:39:36,102	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': DecisionTreeRegressor(max_depth=3)}


Trial train_model_39d58_00003 reported error=461.8333333333333,should_checkpoint=True with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 145}. This trial completed.


2022-11-08 01:39:42,662	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': LinearRegression()}


Trial train_model_39d58_00000 reported error=706.1730143229166,should_checkpoint=True with parameters={'model': LinearRegression(), 'location': 145}. This trial completed.


2022-11-08 01:39:49,760	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': DecisionTreeRegressor(max_depth=3)}


Trial train_model_39d58_00004 reported error=227.6619017499473,should_checkpoint=True with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 166}. This trial completed.


2022-11-08 01:39:49,875	INFO tune.py:758 -- Total run time: 456.79 seconds (456.24 seconds for the tuning loop).


Total number of models: 6
TOTAL TIME TAKEN: 456.80 seconds
Best result: {'model': DecisionTreeRegressor(max_depth=3), 'location': 152}


In [20]:
# Assemble Tune results into a pandas dataframe
results_df = results.get_dataframe()
print(results_df.columns)
results_df = results_df[["config/location", "config/model", "error", "logdir"]]
results_df.rename(columns={'config/location':'location_id'}, inplace=True)
results_df.rename(columns={'config/model':'model_name'}, inplace=True)
results_df.set_index(["location_id", "model_name"], inplace=True)
results_df['checkpoint'] = results_df['logdir'].apply(lambda x: x + "/checkpoint_000000/model")
results_df['checkpoint_dir'] = results_df['logdir'].apply(lambda x: x + "/checkpoint_000000")
results_df.drop("logdir", axis=1, inplace=True)
results_df.head()

Index(['error', 'time_this_iter_s', 'should_checkpoint', 'done',
       'timesteps_total', 'episodes_total', 'training_iteration', 'trial_id',
       'experiment_id', 'date', 'timestamp', 'time_total_s', 'pid', 'hostname',
       'node_ip', 'time_since_restore', 'timesteps_since_restore',
       'iterations_since_restore', 'warmup_time', 'config/location',
       'config/model', 'logdir'],
      dtype='object')


Unnamed: 0_level_0,Unnamed: 1_level_0,error,checkpoint,checkpoint_dir
location_id,model_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
145,LinearRegression(),706.173014,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
166,LinearRegression(),272.062997,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
152,LinearRegression(),266.035645,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
145,DecisionTreeRegressor(max_depth=3),461.833333,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
166,DecisionTreeRegressor(max_depth=3),227.661902,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...


In [21]:
# Keep only 1 model per location_id with minimum error
final_df = results_df.reset_index()
final_df = final_df.loc[final_df.groupby('location_id')['error'].idxmin()].copy()
final_df.sort_values(by=["error"], inplace=True)
final_df.reset_index(inplace=True, drop=True)
print(final_df.dtypes)
final_df

location_id         int64
model_name         object
error             float64
checkpoint         object
checkpoint_dir     object
dtype: object


Unnamed: 0,location_id,model_name,error,checkpoint,checkpoint_dir
0,152,DecisionTreeRegressor(max_depth=3),224.3,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
1,166,DecisionTreeRegressor(max_depth=3),227.661902,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...
2,145,DecisionTreeRegressor(max_depth=3),461.833333,/Users/christy/Documents/github_ray_temp/ray/d...,/Users/christy/Documents/github_ray_temp/ray/d...


## Load a model from checkpoint and perform inference  <a class="anchor" id="load_checkpoint"></a>

In [26]:
# Get a checkpoint_dir.
checkpoint_path = final_df.checkpoint_dir[0]
print(checkpoint_path)

# In another function or script, recover Checkpoint object from path
checkpoint = SklearnCheckpoint.from_directory(checkpoint_path)
print(type(checkpoint))

# Restore a predictor object from checkpoint
predictor = SklearnPredictor.from_checkpoint(checkpoint)
print(type(predictor))

/Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_39d58_00005_5_location=152,model=DecisionTreeRegressor_max_depth_3_2022-11-08_01-32-15/checkpoint_000000
<class 'ray.train.sklearn.sklearn_checkpoint.SklearnCheckpoint'>
<class 'ray.train.sklearn.sklearn_predictor.SklearnPredictor'>


In [27]:
# Get a location
the_location = final_df.location_id[0]
the_location

152

In [50]:
# Create test data
# TODO how to save the test data so you don't have to re-create it?
df_list = [read_data(f, the_location) for f in s3_files]   
df_raw = pd.concat(df_list, ignore_index=True)
df = transform_df(df_raw)

# Train/test split.
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True)
# train_X = train_df[["passenger_count", "trip_distance"]]
# train_y = train_df.trip_duration
test_X = test_df[["passenger_count", "trip_distance"]]
test_y = test_df.trip_duration

In [51]:
# Re-index actual values
test_y = test_y.reset_index(drop=True)
test_y

0     989
1    2398
2     432
3     741
4     242
Name: trip_duration, dtype: int64

In [52]:
# Inference using restored model from checkpoint
pred_y = predictor.predict(test_X)
pd.concat([pred_y, test_y], axis=1, sort=False, join="outer") 

Unnamed: 0,predictions,trip_duration
0,735.4,989
1,1520.2,2398
2,432.0,432
3,735.4,741
4,239.0,242


In [55]:
# Evaluate.
error = sklearn.metrics.mean_absolute_error(test_y, pred_y)
print(f"Test error: {error}")

Test error: 228.0


In [54]:
# Compare test error with train error
print(f"Train error: {final_df.error[0]}")

Train error: 224.3
