# Batch training & tuning on Ray Tune

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc.

In the context of this notebook, batch training is understood as creating the same model(s) for different and separate data, or subsets of a dataset. This notebook showcases how to conduct batch training using [Ray Tune](https://docs.ray.io/en/latest/tune/index.html).

![Batch training diagram](../../data/examples/images/batch-training.svg)

For the data, we will use the [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC. To demonstrate batch training, we will simplify the data to a regression problem to predict `trip_duration` and use scikit-learn.

To demonstrate how batch training can be parallelized, we will train a separate model for each dropoff location. This means we can use the `dropoff_location_id` column in the dataset to group the dataset into data batches. Then we will fit a separate model for each batch and evaluate it.

# Contents

In this this tutorial, you will learn about:
 1. [Introduction to Ray Tune](#intro_tune)
 2. [Define how to load and prepare Parquet data](#load_data)
 3. [Define your Ray Tune Search Space and Search Algorithm](#define_search_space)
 4. [Define a Trainable (callable) function](#define_trainable)
 5. [Run independent trials in Parallel using Ray AIR Tune](#run_tune_search)
 6. [Load a model from checkpoint and perform inference](#load_checkpoint)


# Introduction to Ray Tune <a class="anchor" id="intro_tune"></a>

**While Tune's main purpose is hyperparameter optimization, you can also use it as an execution engine to run parallel trials in any search space.**  

In this notebook, we will use [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) to run separate, parallel training jobs per dropoff location.  After your Tune experiment (all the trials) has run, we will pick the best model per dropoff location.
> An experiment in Tune is defined as a set of trials. 

Let us quickly walk through the [key concepts](https://docs.ray.io/en/latest/tune/key-concepts.html) you need to know to use Tune.

- First, you define a *Search space* and pass that into a `Trainable` or `callable` function, that specifies the objective you want to tune.  The trainable function will be called for every trial in the search space.

- Then you select a *Search algorithm* and optionally use a
scheduler to stop searches early and speed up your experiment.  We will use simple **grid search** (run every permutation as a separate trial).

- Together with other configurations, the Trainable, Search algorithm, and Scheduler are passed into `Tuner`, which runs your experiment trials in parallel.

- The output from `Tuner.fit()` can be analyzed to inspect your experiment results.  The following figure shows an overview of the Ray Tune flow, which we will use in this tutorial.

![Tune flow diagram](../../tune/images/tune_flow.png)

# Walkthrough

Let us start by importing a few required libraries, including open-source [Ray](https://github.com/ray-project/ray) itself!

In [1]:
import os
print(f'Number of CPUs in this system: {os.cpu_count()}')
from typing import Tuple, List, Union, Optional, Callable
import time
import pandas as pd
import numpy as np
import pyarrow
import pyarrow.parquet as pq
import pyarrow.dataset as pds
print(f"pyarrow: {pyarrow.__version__}")

Number of CPUs in this system: 8
pyarrow: 10.0.0


In [2]:
import ray

if ray.is_initialized():
    ray.shutdown()
ray.init()

2022-11-09 20:58:39,476	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


0,1
Python version:,3.8.13
Ray version:,2.0.1
Dashboard:,http://127.0.0.1:8265


In [3]:
print(ray.cluster_resources())

{'CPU': 8.0, 'object_store_memory': 2147483648.0, 'node:127.0.0.1': 1.0, 'memory': 7980502221.0}


In [4]:
# import standard sklearn libraries
import sklearn
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
print(f"sklearn: {sklearn.__version__}")

# import ray libraries
from ray import air, tune
from ray.air import session
from ray.air.checkpoint import Checkpoint

# set global random seed for sklearn models
np.random.seed(415)

sklearn: 1.1.2


In [5]:
# For benchmarking purposes, we can print the times of various operations.
# In order to reduce clutter in the output, this is set to False by default.
PRINT_TIMES = True

def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)
        
# To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019.
# You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False.
SMOKE_TEST = True


## Define how to load and prepare Parquet data <a class="anchor" id="load_data"></a>

First, we need to load some data.  Since the NYC Taxi dataset is fairly large, we will filter files first into a PyArrow dataset. And then in the next cell after, we will filter the data on read into a PyArrow table and convert that to a pandas dataframe.

```{tip}
Use PyArrow dataset and table for reading or writing large parquet files, since its native multithreaded C++ adpater is faster than pandas read_parquet, even using engine=pyarrow.
```

In [6]:
# Define some global variables.
target = "trip_duration"
s3_partitions = pds.dataset(
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
    partitioning=["year", "month"],
)
s3_files = [f"s3://{file}" for file in s3_partitions.files]

# Obtain all location IDs
all_location_ids = (
    pq.read_table(s3_files[0], columns=["dropoff_location_id"])["dropoff_location_id"]
    .unique()
    .to_pylist()
)
# drop [264, 265]
all_location_ids.remove(264)
all_location_ids.remove(265)

# Use smoke testing or not.
starting_idx = -1 if SMOKE_TEST else 0
sample_locations = [145, 152, 204, 199] if SMOKE_TEST else all_location_ids

# Display what data will be used.
s3_files = s3_files[starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")
print(f"s3_files: {s3_files}")
print(f"Locations: {sample_locations}")


NYC Taxi using 1 file(s)!
s3_files: ['s3://air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet']
Locations: [145, 152, 204, 199]


In [7]:
# Function to read a pyarrow.Table object using pyarrow parquet 
def read_data(file: str, sample_id: np.int32) -> pd.DataFrame:
    
    df = pq.read_table(
        file,
        filters=[
            ("passenger_count", ">", 0),
            ("trip_distance", ">", 0),
            ("fare_amount", ">", 0),
            ("pickup_location_id", "in", [264, 265]),
            ("dropoff_location_id", "not in", [264, 265]), 
            ("dropoff_location_id", "=", sample_id)
        ],
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
            "passenger_count",
            "trip_distance",
            "fare_amount",
        ],
    ).to_pandas()

    return df

# Function to transform a pandas dataframe
def transform_df(the_df: pd.DataFrame) -> pd.DataFrame:
    df = the_df.copy()
    
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_duration"] > 60]
    df = df[df["trip_duration"] < 24 * 60 * 60] 
    df.drop(["dropoff_at", "pickup_at", "pickup_location_id", "fare_amount"]
            , axis=1, inplace=True)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

In [8]:
# %%time

# # Test reading data.
# import itertools
# my_list = itertools.product(s3_files, sample_locations)

# # [print(f[0], f[1]) for f in my_list]  
# df_list = [read_data(f[0], f[1]) for f in my_list]
# df_raw = pd.concat(df_list, ignore_index=True)
# # Transform data.
# df = transform_batch(df_raw)

# # Inspect the pandas dataframe.
# df.head()

## Define your Ray Tune Search Space and Search Algorithm <a class="anchor" id="define_search_space"></a>

**First, define a search space of experiment trials to run.** 
> The search space together with a search algorithm determine which trials will be run.  

Common search algorithms include grid search, random search, Bayesian search, Hyperopt, and Optuna.  For more details, see [Working with Tune Search Spaces](https://docs.ray.io/en/master/tune/tutorials/tune-search-spaces.html#tune-search-space-tutorial).  Deciding the best combination of search space and search algorithm is part of the art of being a Data Scientist and depends on the data, algorithm, and problem being solved!  

**Below, we define our search space is:**
- 2 different Scikit-learn algorithms 
- Some or all NYC taxi drop-off locations. 

**And we define the search algorithm is:**
- Grid search.

> This means every permutation of each algorithm and each NYC Taxi drop-off location will be run as a separate trial!  

Ray Tune partitions the Search space using the specified Search algorithm and takes care of running a Tune job on each partition in parallel.  Specifically, Ray Tune will pass a config dictionary to each partition and make a Trainable function call.

In [9]:
# 1. Define a search space.
sample_locations = [145, 152, 204, 199] if SMOKE_TEST else all_location_ids
search_space = {
    "model": tune.grid_search([LinearRegression(fit_intercept=True), 
                                DecisionTreeRegressor(max_depth=3)]),
    "location": tune.grid_search(sample_locations),
}

## Define a Trainable (callable) function <a class="anchor" id="define_trainable"></a>

🧪 Typically when you are running Data Science experiments, you want to be able to keep track of summary metrics for each trial, so you can decide at the end which trials were the best.  That way, you can decide which model to deploy.

📋 Ray Tune produces an experiment Summary table with metrics which you specify how to calculate inside a "Trainable" or "callable" function.

<b>Define a "Trainable" or "callable" function that can be called by every parallel Tune trial.</b> 
>Ray Tune has two ways of defining a trainable, namely the [Function API](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#trainable-docs) and the Class API. Both are valid ways of defining a trainable, but *the Function API is generally recommended*.

**In the cell below, we define a "Trainable" function called `train_model()`**.  
- It takes as input a config dictionary argument. 
- The output can be a simple dictionary of metrics which will be reported back to the Tune Summary table.  In our case, we've chosen to checkpoint save each model in addition to reporting each trial's metrics.
- `train_model()` will be called in parallel by Tune for each partition of the Tune search space.  
- Since we are using **grid search**, this means `train_model()` will be run *in parallel for every permutation* in the Tune search space!

In [10]:
# 2. Define a custom train function
def train_model(config: dict):

    model = config['model']
    the_location = config['location']
    
    # Load data.
    df_list = [read_data(f, the_location) for f in s3_files]   
    df_raw = pd.concat(df_list, ignore_index=True)
    df = transform_df(df_raw)
    
    # We need at least 4 rows to create a train / test split.
    if len(df) < 4:
        print_time(f"Data batch for LocID {location_id} is empty or smaller than 4 rows")
        return None
        
    # Train/valid split.
    train_df, valid_df = train_test_split(df, test_size=0.2, shuffle=True)
    train_X = train_df[["passenger_count", "trip_distance"]]
    train_y = train_df.trip_duration
    valid_X = valid_df[["passenger_count", "trip_distance"]]
    valid_y = valid_df.trip_duration

    # Train model.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)

    # Evaluate.
    error = sklearn.metrics.mean_absolute_error(valid_y, pred_y)

    # Save the model as a Tune Checkpoint.  
    # This is done through ray.air.session.report() API.
    # https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html

    # Define a model checkpoint.
    checkpoint = Checkpoint.from_dict({
        "model": model, 
        "location_id": the_location})

    # Save checkpoint and report back metrics, using ray.air.session.report()
    metrics = dict(error = error)
    session.report(
            metrics, 
            checkpoint=checkpoint)

## Run independent trials in Parallel using Ray Tune <a class="anchor" id="run_tune_search"></a>

Now we are ready to use Ray Tune to run separate, parallel training jobs, each as a different model training job, per dropoff location.

**Configure the resources allocated per trial.** Tune uses this resources allocation to control the parallelism. For example, if each trial was configured to use 4 CPUs, and the cluster had only 32 CPUs, then Tune will limit the number of concurrent trials to 8 to avoid overloading the cluster. For more information, see [A Guide To Parallelism and Resources](https://docs.ray.io/en/master/tune/tutorials/tune-resources.html#tune-parallelism).

In [11]:
# 3. Customize resources per trial, here we set 1 CPU each.
train_model = tune.with_resources(train_model, {"cpu": 1})

**Below, we introduce AIR configs and run the experiment using `tuner.fit()`.** Tune will report on experiment status, and after the experiment finishes, you can inspect the results. 

Notice in the AIR config, we have specified a local directory `my_Tune_logs` for logging instead of the default `~/ray_results` directory. Learn more about logging Tune results at [How to configure logging in Tune](https://docs.ray.io/en/master/tune/tutorials/tune-output.html#tune-logging).

Tune can [retry failed experiments automatically](https://docs.ray.io/en/master/tune/tutorials/tune-stopping.html#tune-stopping-guide), as well as entire experiments.  This is necessary in case a node on your remote cluster fails (when running on a cloud such as AWS or GCP).

In [12]:
# Define a tuner object using Ray AIR Tuner API
tuner = tune.Tuner(
    train_model, 
    param_space=search_space,
    run_config=air.RunConfig(
        
        #redirect logs to relative path instead of default ~/ray_results/
        local_dir = "my_Tune_logs",
        name = "batch_tuning",

        # Set Ray Tune verbosity.  Print summary table only with levels 2 or 3.
        verbose=2,
        ),
)

# 4. Run the experiment with Ray Tune
start = time.time()
results = tuner.fit()
total_time_taken = time.time() - start

# Print some training stats
print(f"Total number of models: {len(results)}")
print(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds")
best_result = results.get_best_result(metric="error", mode="min").config
print(f"Best result: {best_result}")

# Total number of models: 518
# TOTAL TIME TAKEN: 1585.76 seconds
# Best result: {'model': DecisionTreeRegressor(max_depth=3), 'location': 84}



Trial name,status,loc,location,model,iter,total time (s),error
train_model_5a681_00000,TERMINATED,127.0.0.1:84627,145,LinearRegression(),1.0,115.641,785.425
train_model_5a681_00001,TERMINATED,127.0.0.1:84632,152,LinearRegression(),1.0,119.044,239.382
train_model_5a681_00004,TERMINATED,127.0.0.1:84638,145,DecisionTreeReg_9a90,1.0,117.44,564.631
train_model_5a681_00005,TERMINATED,127.0.0.1:84641,152,DecisionTreeReg_9c10,1.0,119.257,318.76
train_model_5a681_00002,ERROR,127.0.0.1:84633,204,LinearRegression(),,,
train_model_5a681_00003,ERROR,127.0.0.1:84634,199,LinearRegression(),,,
train_model_5a681_00006,ERROR,127.0.0.1:84642,204,DecisionTreeReg_9d90,,,
train_model_5a681_00007,ERROR,127.0.0.1:84644,199,DecisionTreeReg_9f10,,,

Trial name,# failures,error file
train_model_5a681_00002,1,"/Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00002_2_location=204,model=LinearRegression_2022-11-09_20-58-46/error.txt"
train_model_5a681_00003,1,"/Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00003_3_location=199,model=LinearRegression_2022-11-09_20-58-46/error.txt"
train_model_5a681_00006,1,"/Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00006_6_location=204,model=DecisionTreeRegressor_max_depth_3_2022-11-09_20-58-46/error.txt"
train_model_5a681_00007,1,"/Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00007_7_location=199,model=DecisionTreeRegressor_max_depth_3_2022-11-09_20-58-46/error.txt"


[2m[36m(train_model pid=84642)[0m 2022-11-09 21:00:41,995	ERROR function_trainable.py:298 -- Runner Thread raised error.
[2m[36m(train_model pid=84642)[0m Traceback (most recent call last):
[2m[36m(train_model pid=84642)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
[2m[36m(train_model pid=84642)[0m     self._entrypoint()
[2m[36m(train_model pid=84642)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 362, in entrypoint
[2m[36m(train_model pid=84642)[0m     return self._trainable_func(
[2m[36m(train_model pid=84642)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
[2m[36m(train_model pid=84642)[0m     return method(self, *_args, **_kwargs)
[2m[36m(train_model pid=84642)[0m   File "/Users/christy/miniforge3/env

The trial train_model_5a681_00006 errored with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 204}. Error file: /Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00006_6_location=204,model=DecisionTreeRegressor_max_depth_3_2022-11-09_20-58-46/error.txt
Trial train_model_5a681_00000 reported error=785.4246215820312,should_checkpoint=True with parameters={'model': LinearRegression(), 'location': 145}.


2022-11-09 21:00:42,322	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': LinearRegression()}


Trial train_model_5a681_00000 completed. Last result: error=785.4246215820312,should_checkpoint=True


2022-11-09 21:00:43,564	ERROR trial_runner.py:987 -- Trial train_model_5a681_00007: Error processing event.
ray.exceptions.RayTaskError(NameError): [36mray::ImplicitFunc.train()[39m (pid=84644, ip=127.0.0.1, repr=train_model)
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 349, in train
    result = self.step()
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 417, in step
    self._report_thread_runner_error(block=True)
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 589, in _report_thread_runner_error
    raise e
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
    self._entrypoint()
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py

The trial train_model_5a681_00007 errored with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 199}. Error file: /Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00007_7_location=199,model=DecisionTreeRegressor_max_depth_3_2022-11-09_20-58-46/error.txt


2022-11-09 21:00:45,681	ERROR trial_runner.py:987 -- Trial train_model_5a681_00003: Error processing event.
ray.exceptions.RayTaskError(NameError): [36mray::ImplicitFunc.train()[39m (pid=84634, ip=127.0.0.1, repr=train_model)
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 349, in train
    result = self.step()
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 417, in step
    self._report_thread_runner_error(block=True)
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 589, in _report_thread_runner_error
    raise e
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
    self._entrypoint()
  File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py

The trial train_model_5a681_00003 errored with parameters={'model': LinearRegression(), 'location': 199}. Error file: /Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00003_3_location=199,model=LinearRegression_2022-11-09_20-58-46/error.txt


[2m[36m(train_model pid=84633)[0m 2022-11-09 21:00:45,832	ERROR function_trainable.py:298 -- Runner Thread raised error.
[2m[36m(train_model pid=84633)[0m Traceback (most recent call last):
[2m[36m(train_model pid=84633)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
[2m[36m(train_model pid=84633)[0m     self._entrypoint()
[2m[36m(train_model pid=84633)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 362, in entrypoint
[2m[36m(train_model pid=84633)[0m     return self._trainable_func(
[2m[36m(train_model pid=84633)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
[2m[36m(train_model pid=84633)[0m     return method(self, *_args, **_kwargs)
[2m[36m(train_model pid=84633)[0m   File "/Users/christy/miniforge3/env

The trial train_model_5a681_00002 errored with parameters={'model': LinearRegression(), 'location': 204}. Error file: /Users/christy/Documents/github_ray_temp/ray/doc/source/ray-air/examples/my_Tune_logs/batch_tuning/train_model_5a681_00002_2_location=204,model=LinearRegression_2022-11-09_20-58-46/error.txt


2022-11-09 21:00:46,743	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': DecisionTreeRegressor(max_depth=3)}


Trial train_model_5a681_00004 reported error=564.6305555555556,should_checkpoint=True with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 145}.
Trial train_model_5a681_00004 completed. Last result: error=564.6305555555556,should_checkpoint=True
Trial train_model_5a681_00001 reported error=239.38212890625,should_checkpoint=True with parameters={'model': LinearRegression(), 'location': 152}.


2022-11-09 21:00:48,429	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': LinearRegression()}


Trial train_model_5a681_00001 completed. Last result: error=239.38212890625,should_checkpoint=True


2022-11-09 21:00:48,661	INFO tensorboardx.py:267 -- Removed the following hyperparameter values when logging to tensorboard: {'model': DecisionTreeRegressor(max_depth=3)}


Trial train_model_5a681_00005 reported error=318.76000000000005,should_checkpoint=True with parameters={'model': DecisionTreeRegressor(max_depth=3), 'location': 152}.
Trial train_model_5a681_00005 completed. Last result: error=318.76000000000005,should_checkpoint=True


2022-11-09 21:00:48,771	ERROR tune.py:754 -- Trials did not complete: [train_model_5a681_00002, train_model_5a681_00003, train_model_5a681_00006, train_model_5a681_00007]
2022-11-09 21:00:48,771	INFO tune.py:758 -- Total run time: 124.23 seconds (123.57 seconds for the tuning loop).


Total number of models: 8
TOTAL TIME TAKEN: 124.24 seconds
Best result: {'model': LinearRegression(), 'location': 152}


<br>

<b>After the Tune experiment has run, pick the best model per dropoff location. </b>

We can assemble the Tune results ([ResultGrid object](https://docs.ray.io/en/master/tune/examples/tune_analyze_results.html)) into a pandas dataframe, then sort by minimum error, to select the best model per dropoff location.

In [14]:
# get a list of training losses errors
errors = []
[errors.append(i.metrics.get('error',10000.0)) for i in results]

# get a list of checkpoints
checkpoints = []
[checkpoints.append(i.checkpoint) for i in results] 

# get a list of locations
locations = []
[locations.append(i.config['location']) for i in results]

# get a list of models
models = []
[models.append(i.config['model']) for i in results]

# Assemble a pandas dataframe from Tune results
results_df = pd.DataFrame(zip(locations, models, errors,checkpoints),
                          columns = ['location_id', 'model', 'error', 'checkpoint']
                         )
print(results_df.dtypes)
results_df.head()

location_id      int64
model           object
error          float64
checkpoint      object
dtype: object


Unnamed: 0,location_id,model,error,checkpoint
0,145,LinearRegression(),785.424622,Checkpoint(local_path=/Users/christy/Documents...
1,152,LinearRegression(),239.382129,Checkpoint(local_path=/Users/christy/Documents...
2,204,LinearRegression(),10000.0,
3,199,LinearRegression(),10000.0,
4,145,DecisionTreeRegressor(max_depth=3),564.630556,Checkpoint(local_path=/Users/christy/Documents...


In [15]:
# Keep only 1 model per location_id with minimum error
final_df = results_df.loc[results_df.groupby('location_id')['error'].idxmin()].copy()
final_df.sort_values(by=["error"], inplace=True)
final_df.set_index('location_id', inplace=True, drop=True)
print(final_df.dtypes)
final_df

model          object
error         float64
checkpoint     object
dtype: object


Unnamed: 0_level_0,model,error,checkpoint
location_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
152,LinearRegression(),239.382129,Checkpoint(local_path=/Users/christy/Documents...
145,DecisionTreeRegressor(max_depth=3),564.630556,Checkpoint(local_path=/Users/christy/Documents...
199,LinearRegression(),10000.0,
204,LinearRegression(),10000.0,


## Load a model from checkpoint and perform inference  <a class="anchor" id="load_checkpoint"></a>

```{tip}
[Ray AIR Predictors](https://docs.ray.io/en/latest/ray-air/predictors.html) make batch inference easy since they have internal logic to parallelize the inference.
```

However, in this notebook, we will just restore a single scikit-learn model directly from checkpoint, and demonstrate it can be used for inference.  

Below, we can easily obtain AIR Checkpoint objects from the Tune results. 

In [16]:
# Get a dropoff location
the_location = final_df.index[0]
the_location

152

In [17]:
# Get a checkpoint directly from Ray Tune results
checkpoint = final_df.checkpoint[the_location]
print(type(checkpoint))

# Restore a model from checkpoint
model = checkpoint.to_dict()['model']
print(type(model))

<class 'ray.air.checkpoint.Checkpoint'>
<class 'sklearn.linear_model._base.LinearRegression'>


In [18]:
# Create some test data
df_list = [read_data(f, the_location) for f in s3_files[:1]]   
df_raw = pd.concat(df_list, ignore_index=True)
df = transform_df(df_raw)

# Train/test split.
_, test_df = train_test_split(df, test_size=0.2, shuffle=True)
test_X = test_df[["passenger_count", "trip_distance"]]
test_y = np.array(test_df.trip_duration)  #actual values

In [19]:
# Perform inference using restored model from checkpoint
pred_y = model.predict(test_X)

# Zip together predictions and actuals to visualize
pd.DataFrame(zip(pred_y, test_y), 
             columns = ["pred_y", "trip_duration"])[0:10]

Unnamed: 0,pred_y,trip_duration
0,528.874817,542
1,1422.908569,1630
2,2177.249512,3157
3,417.120636,432
4,619.644043,710


<b>Compare validation and test error.</b>

During model training we reported error on "validation" data (random sample).  Below, we will report error on a pretend "test" data set (a different random sample).

Do a quick validation that both errors are reasonably close together.

In [20]:
# Evaluate restored model on test data.
error = sklearn.metrics.mean_absolute_error(test_y, pred_y)
print(f"Test error: {error}")

Test error: 261.0404846191406


In [21]:
# Compare test error with training validation error
print(f"Validation error: {final_df.error[the_location]}")

# Validation and test errors should be reasonably close together.

Validation error: 239.38212890625
