# Introduction to Batch Training with Ray Datasets

### Learning objectives
In this this tutorial, you will learn about:
 * [Ray Dataset](#dataset)
 * [Batch training functions](#train_func)
 * [Ray Tune](#tune)

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training using [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html), [Ray AIR Trainers](https://docs.ray.io/en/master/ray-air/trainer.html#air-trainers), and [Ray Tune](https://docs.ray.io/en/master/ray-air/tuner.html).

For the data, we will use the [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).  This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC.  <s>The goal is to predict future, hourly taxi demand by location in NYC.</s>  To demonstrate batch training & tuning, we will simplify the data to a linear regression problem to predict `trip_duration` and use Scikit-learn.

To demonstrate how data and training can be batch-parallelized, we will train a separate model for each pickup location. This means we can use the pickup_location_id column in the dataset to group the dataset into data batches. Then we will fit a separate model for each batch. 

Let’s start by importing a few required libraries, including open-source [Ray](https://github.com/ray-project/ray) itself!

In [1]:
import os, time
import random
from typing import Tuple, List, Union, Optional, Callable
import pandas as pd
import numpy as np
import pyarrow.dataset as pds
from pyarrow import fs
from ray.data import Dataset
from ray.data.preprocessors import Chain, OrdinalEncoder, StandardScaler

num_available_cpus = os.cpu_count()
print(f'Number of CPUs in this system: {num_available_cpus}')

# import utility functions
import local_utils.dataprep

Number of CPUs in this system: 8


In [2]:
import ray
if ray.is_initialized():
    ray.shutdown()
ray.init(ignore_reinit_error=True)

2022-10-12 12:42:54,283	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m


0,1
Python version:,3.8.13
Ray version:,2.0.0
Dashboard:,http://127.0.0.1:8266


In [3]:
# For benchmarking purposes, we can print the times of various operations. 
# In order to reduce clutter in the output, this is set to False by default.
PRINT_TIMES = True

def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)

In [4]:
# To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019. 
# You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False.

SMOKE_TEST = True

# Data <a class="anchor" id="dataset"></a>

Next, read some data using Ray Dataset.   This will initialize a Ray cluster.  Then we can use the [Ray Dataset](https://docs.ray.io/en/latest/data/getting-started.html#datasets-getting-started) APIs to quickly inspect the data.

In [5]:
# Define some global variables.
target = "trip_duration"
s3  = fs.S3FileSystem(region="us-east-2")
s3_partitions = pds.dataset("ursa-labs-taxi-data/", filesystem=s3, partitioning=["year", "month"])

if SMOKE_TEST:
    starting_idx = -1
    sample_locations = random.sample(list(local_utils.dataprep.location_ids), 3)
else:
    starting_idx = -3
    sample_locations = list(local_utils.dataprep.location_ids)

s3_files = [f"s3://{file}" for file in s3_partitions.files][starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")   
print(f"sample locations: {sample_locations}")


NYC Taxi using 1 file(s)!
sample locations: [55, 235, 198]


In [6]:
# Read some Parquet files in parallel.
rds = ray.data.read_parquet(s3_files)
print(type(rds))



<class 'ray.data.dataset.Dataset'>


In [7]:
# Parquet stores the number of rows per file in the Parquet metadata, 
# so we can get the number of rows in rds without triggering a full data read!
print(f"Number rows: {rds.count()}")

# Parquet pulls size-in-bytes from its metadata (not triggering a data read)
# This could be significantly different than actual in-memory size!
print(f"Size bytes (from parquet metadata): {rds.size_bytes()}")
# Trigger full reading of the dataset and inspect the size in bytes.
print(f"Size bytes (from full data read): {rds.fully_executed().size_bytes()}")

# Fetch the schema from the underlying Parquet metadata.
print("\nSchema data types:")
data_types = list(zip(rds.schema().names, rds.schema().types))
[print(f"{s[0]}: {s[1]}") for s in data_types]

# Take a peek at a sample row
print("\nLook at a sample row:")
rds.take(1)

# Number rows: 6941024

Number rows: 6941024
Size bytes (from parquet metadata): 602373955


Read progress: 100%|██████████████████████████████| 1/1 [01:23<00:00, 83.47s/it]

Size bytes (from full data read): 573504109

Schema data types:
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
rate_code_id: string
store_and_fwd_flag: string
pickup_location_id: int32
dropoff_location_id: int32
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
improvement_surcharge: float
total_amount: float
congestion_surcharge: float

Look at a sample row:





[ArrowRow({'vendor_id': '1',
           'pickup_at': datetime.datetime(2019, 6, 1, 0, 55, 13),
           'dropoff_at': datetime.datetime(2019, 6, 1, 0, 56, 17),
           'passenger_count': 1,
           'trip_distance': 0.0,
           'rate_code_id': '1',
           'store_and_fwd_flag': 'N',
           'pickup_location_id': 145,
           'dropoff_location_id': 145,
           'payment_type': '2',
           'fare_amount': 3.0,
           'extra': 0.5,
           'mta_tax': 0.5,
           'tip_amount': 0.0,
           'tolls_amount': 0.0,
           'improvement_surcharge': 0.30000001192092896,
           'total_amount': 4.300000190734863,
           'congestion_surcharge': 0.0})]

In [8]:
# # Q. Is there an easier way to get count distinct?

# # Num distinct pickup location_ids
# groupby_agg = rds.groupby("pickup_location_id").mean("trip_distance").take()
# num_location_id = len(groupby_agg)
# print(f"Count distinct pickup location ids: {num_location_id}")

<b>Filter on Read - Projection and Filter Pushdown</b>

Note that Ray Datasets' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk!

The row-based filter is specified via [Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). 

<b>Best practice is to filter as much as you can directly in the Ray Dataset read_parquet() statement.</b>


In [9]:
def pushdown_read_data(files_list: list,
                       sample_ids: list) -> Dataset:
    filter_expr = (
        (pds.field("passenger_count") > 0)
        & (pds.field("trip_distance") > 0)
        & (pds.field("fare_amount") > 0)
        & (~pds.field("pickup_location_id").isin([264, 265]))
        & (~pds.field("dropoff_location_id").isin([264, 265]))
        & (pds.field("pickup_location_id").isin(sample_ids))
    )

    the_dataset = ray.data.read_parquet(
        files_list,
        columns=[
            'pickup_at', 'dropoff_at', 
            'pickup_location_id', 'dropoff_location_id',
            'passenger_count', 'trip_distance', 'fare_amount'], 
        filter=filter_expr,
    )

    # Force full execution of both of the file reads.
    the_dataset = the_dataset.fully_executed()
    return the_dataset

In [10]:
# Test the pushdown_read_data function
pushdown_ds = pushdown_read_data(s3_files, sample_locations)

print(f"Number rows: {pushdown_ds.count()}")
# Display some metadata about the dataset.
print("\nMetadata: ")
print(pushdown_ds)
# Fetch the schema from the underlying Parquet metadata.
print("\nSchema:")
print(pushdown_ds.schema())
# Take a peek at a single row
print("\nLook at a sample row:")
pushdown_ds.take(1)


Read progress: 100%|██████████████████████████████| 1/1 [00:40<00:00, 40.32s/it]

Number rows: 737

Metadata: 
Dataset(num_blocks=1, num_rows=737, schema={pickup_at: timestamp[us], dropoff_at: timestamp[us], pickup_location_id: int32, dropoff_location_id: int32, passenger_count: int8, trip_distance: float, fare_amount: float})

Schema:
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
pickup_location_id: int32
dropoff_location_id: int32
passenger_count: int8
trip_distance: float
fare_amount: float
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2548

Look at a sample row:





[ArrowRow({'pickup_at': datetime.datetime(2019, 6, 1, 0, 19, 59),
           'dropoff_at': datetime.datetime(2019, 6, 1, 0, 21, 53),
           'pickup_location_id': 235,
           'dropoff_location_id': 243,
           'passenger_count': 1,
           'trip_distance': 0.6399999856948853,
           'fare_amount': 4.0})]

In [11]:
# check sampling
df = pushdown_ds.to_pandas(limit=pushdown_ds.count())
print(df[["pickup_location_id", "trip_distance"]].groupby("pickup_location_id").count())

# # How many ids in all the data?
# df = rds.to_pandas(limit=rds.count())
# print("\nCount distinct location_ids in original data")
# print(df[["pickup_location_id", "trip_distance"]].groupby("pickup_location_id").count().shape[0])
# # print(df[["pickup_location_id", "trip_distance"]].groupby("pickup_location_id").count())


                    trip_distance
pickup_location_id               
55                            196
198                           253
235                           288


<b>Custom data transform functions</b>

Ray Datasets allows you to specify custom data transform functions using familiar syntax, such as Pandas.  These <b>custom functions, or UDFs,</b> can be called using `rds.map_batches(my_UDF, batch_format="pandas")`.  It is necessary to specify the language you are using the `batch_format parameter`.

TODO: Reference link for syntax supported in Datasets UDFs <br>
TODO: Mention chaining UDFs using [BatchMapper](https://docs.ray.io/en/latest/ray-air/check-ingest.html) <br>
TODO: Add standard scaler step here

Normally there is some data exploration to determine the cleaning steps.  Let's just assume we know the data cleaning steps are:
- Drop negative trip distances, 0 fares, 0 passengers, less than 1min trip durations
- Drop 2 unknown zones ['264', '265']
- Calculate trip duration in minutes and add it as a new column

In [12]:
# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(the_df: pd.DataFrame) -> pd.DataFrame:
    df = the_df.copy()    
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds    
    df = df[df["trip_duration"] > 60]    
    df.drop(["dropoff_at", "pickup_at", "dropoff_location_id"], axis=1, inplace=True)
    df['pickup_location_id'] = df['pickup_location_id'].fillna(-1)
    return df

In [13]:
# Test the transform UDF function
print(f"Before transform number rows: {pushdown_ds.count()}")

# batch_format="pandas" tells Datasets to provide the transformer with blocks
# represented as Pandas DataFrames.
pushdown_ds = pushdown_ds.map_batches(transform_batch, batch_format="pandas")

# verify row count
pushdown_rows = pushdown_ds.count()
print(f"After transform number rows: {pushdown_rows}")

# Looks good. Replace ds with pushdown
rds = pushdown_ds


Before transform number rows: 737


Map_Batches: 100%|████████████████████████████████| 1/1 [00:00<00:00,  7.13it/s]

After transform number rows: 721





<b>Random shuffle</b>

Randomly shuffling data is an important part of training machine learning models: it decorrelates samples, preventing overfitting and improving generalization. For many models, even between-epoch shuffling can drastically improve the precision gain per step/epoch. Datasets has a hyper-scalable distributed random shuffle that allows you to realize the model accuracy benefits of per-epoch shuffling without sacrificing training throughput, even at large data scales and even when doing distributed data-parallel training across multiple GPUs/nodes.

In [14]:
# do a full global random shuffle to decorrelate the data
rds = rds.random_shuffle()

Shuffle Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 49.06it/s]
Shuffle Reduce: 100%|████████████████████████████| 1/1 [00:00<00:00, 150.67it/s]


In [15]:
# delete data to free up memory in our Ray cluster
del rds
del pushdown_ds

<b>Tidying up</b>

To make our code easier to read, let's summarize the data processing functions again here.

In [16]:
def pushdown_read_data(files_list: list,
                       sample_ids: list) -> Dataset:
    
    start = time.time()
    
    filter_expr = (
        (pds.field("passenger_count") > 0)
        & (pds.field("trip_distance") > 0)
        & (pds.field("fare_amount") > 0)
        & (~pds.field("pickup_location_id").isin([264, 265]))
        & (~pds.field("dropoff_location_id").isin([264, 265]))
        & (pds.field("pickup_location_id").isin(sample_ids))
    )

    the_dataset = ray.data.read_parquet(
        files_list,
        columns=[
            'pickup_at', 'dropoff_at', 
            'pickup_location_id', 'dropoff_location_id',
            'passenger_count', 'trip_distance', 'fare_amount'], 
        filter=filter_expr,
    )

    # Force full execution of both of the file reads.
    the_dataset = the_dataset.fully_executed()
    
    data_loading_time = time.time() - start
    print_time(f"Data loading time: {data_loading_time:.2f} seconds")
    return the_dataset

# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(the_df: pd.DataFrame) -> pd.DataFrame:
    start = time.time()
    
    df = the_df.copy()    
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds    
    df = df[df["trip_duration"] > 60]    
    df.drop(["dropoff_at", "pickup_at", "dropoff_location_id"], axis=1, inplace=True)
    df['pickup_location_id'] = df['pickup_location_id'].fillna(-1)
    
    data_transform_time = time.time() - start
    # print_time(f"Data transform time: {data_transform_time:.2f} seconds")
    return df

# Define batch training functions <a class="anchor" id="train_func"></a>

Now that we've learned more about our data and we have cleaned our data, we now look at how we can feed this dataset into some model trainers.

In [17]:
# Q. scikit-learn is supposed to be pre-installed, but might have to run this in terminal !?
# python3 -m pip install scikit-learn

In [18]:
import sklearn
from sklearn.base import BaseEstimator 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from ray.train.sklearn import SklearnTrainer, SklearnPredictor
from ray.train.batch_predictor import BatchPredictor

<b>Define training functions</b>

- TODO: double-check Core batch example to make sure using same metrics!
- TODO: Add more explanations here for each function.

We define a `fit_and_score_sklearn` actor, where each Scikit-learn training task will consume a dataset shard in batches.


In [19]:
# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
    train_df: pd.DataFrame, test_df: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:
    
    # Assemble train/test pandas dfs
    train_X = train_df[["passenger_count", "trip_distance", "fare_amount"]]
    train_y = train_df.trip_duration
    test_X = test_df[["passenger_count", "trip_distance", "fare_amount"]]
    test_y = test_df.trip_duration
    
    # Start training.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    error = sklearn.metrics.mean_absolute_error(test_y, pred_y)
    
    return str(model), error

def train_and_evaluate(
    the_df: pd.DataFrame, 
    models: List[BaseEstimator]
) -> List[Tuple[BaseEstimator, float]]:
    
    # check if input df is big enough for training
    if len(the_df) < 4:
        print(f"Dataframe for LocID: {i} is empty or smaller than 4")
        return None
    else:
        loc_id = the_df.pickup_location_id[0]
        # print(f"Processing location {loc_id}...")
    
    start = time.time()

    # Train / test split
    # Randomly split the data into 80/20 train/test.
    train_df, test_df = train_test_split(the_df, test_size=0.2)
    
    # We put the train & test dataframes into Ray object store
    # so that they can be reused by all models fitted here.
    # https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-3-avoid-passing-same-object-repeatedly-to-remote-tasks
    train_ref = ray.put(train_df)
    test_ref = ray.put(test_df)

    # Launch a fit and score task for each model.
    results = ray.get(
        [fit_and_score_sklearn.remote(train_ref, test_ref, model) for model in models]
    )
    # results.sort(key=lambda x: x[1])  # sort by error
    
    # Assemble name of model and metrics in a pandas DataFrame
    results = [loc_id] + list(results[0])
    results_return = pd.DataFrame(columns=['location_id', 'model', 'error'])
    results_return.loc[0] = results

    training_time = time.time() - start
    print_time(f"Training time for LocID {loc_id}: {training_time:.2f} seconds")
    
    return results_return

def agg_func(the_df: pd.DataFrame):
    
    models = [LinearRegression()]
    
    # transform the input pandas AND fit_and_evaluate the transformed pandas
    ret = train_and_evaluate(transform_batch(the_df), models)
    
    # print(f"agg_func returned type: {type(ret)}")
    return ret
    

<b>Main driver code</b>

During groupby, each grouped dataset can be mapped to a custom aggregation function.  Using the pattern [groupby-map_groups(agg_func, "pands")](https://docs.ray.io/en/latest/data/api/grouped_dataset.html).  This implements an accumulator-based aggregation.  Similar to Ray Datasets UDFs, which you learned about in the `Data` section earlier in this notebook, you can write custom aggregation functions using familiar syntax, such as Pandas. It is necessary to specify the language you are using the `batch_format` parameter.

See the main driver code below for an example how `map_groups` is used with Ray Dataset to batch transform-train-fit in parallel separate shards of data.  

In [21]:
# Driver code to run this.

SMOKE_TEST = True
if SMOKE_TEST:
    starting_idx = -1
    sample_locations = random.sample(list(local_utils.dataprep.location_ids), 3)
else:
    starting_idx = -3
    sample_locations = list(local_utils.dataprep.location_ids)

s3_files = [f"s3://{file}" for file in s3_partitions.files][starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")   
print(f"sample locations: {sample_locations}")

start = time.time()

# Read data into Ray Dataset
rds = pushdown_read_data(s3_files, sample_locations)

# Do a full global random shuffle to decorrelate the data
rds = rds.random_shuffle()

# This returns a Ray Datset
results = rds.groupby("pickup_location_id").map_groups(
            agg_func, batch_format="pandas")
print(f"groupby.map_groups() returned type: {type(results)}")

total_time_taken = time.time() - start
print(f"Total number of models: {len(sample_locations)}")
print_time(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds")



NYC Taxi using 1 file(s)!
sample locations: [233, 39, 85]


Read progress: 100%|██████████████████████████████| 1/1 [00:36<00:00, 36.17s/it]


Data loading time: 41.24 seconds


Shuffle Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 18.82it/s]
Shuffle Reduce: 100%|█████████████████████████████| 1/1 [00:00<00:00, 93.93it/s]
Sort Sample: 100%|████████████████████████████████| 1/1 [00:00<00:00, 82.47it/s]
Shuffle Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 62.90it/s]
Shuffle Reduce: 100%|█████████████████████████████| 1/1 [00:00<00:00, 64.07it/s]
Map_Batches: 100%|████████████████████████████████| 1/1 [00:04<00:00,  4.94s/it]

groupby.map_groups() returned type: <class 'ray.data.dataset.Dataset'>
Total number of models: 3
TOTAL TIME TAKEN: 46.35 seconds





[2m[36m(_map_block_nosplit pid=99553)[0m Training time for LocID 39: 0.60 seconds

In [22]:
# Sort results ascending by error

print(type(results))

# sort values by ascending error
results_df = results.to_pandas(limit=results.count())
results_df.sort_values(by=["error"], ascending=True, inplace=True)
results_df


[2m[36m(_map_block_nosplit pid=99553)[0m Training time for LocID 85: 0.00 seconds
[2m[36m(_map_block_nosplit pid=99553)[0m Training time for LocID 233: 0.02 seconds
<class 'ray.data.dataset.Dataset'>


Unnamed: 0,location_id,model,error
0,39,LinearRegression(),426.493309
2,233,LinearRegression(),466.17403
1,85,LinearRegression(),529.220534


<b>Main driver code, running on all the data</b>

The Smoke test worked, so now let us run the main driver code again, to batch train every location_id in parallel, with all the data files this time!


In [23]:
# Driver code to run this.

SMOKE_TEST = False
if SMOKE_TEST:
    starting_idx = -1
    sample_locations = random.sample(list(local_utils.dataprep.location_ids), 3)
else:
    starting_idx = -3
    sample_locations = list(local_utils.dataprep.location_ids)

s3_files = [f"s3://{file}" for file in s3_partitions.files][starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")   
print(f"sample locations: {sample_locations}")

start = time.time()

# Read data into Ray Dataset
rds = pushdown_read_data(s3_files, sample_locations)

# Do a full global random shuffle to decorrelate the data
rds = rds.random_shuffle()

# This returns a Ray Datset
results = rds.groupby("pickup_location_id").map_groups(
            agg_func, batch_format="pandas")
print(f"groupby.map_groups() returned type: {type(results)}")

total_time_taken = time.time() - start
print(f"Total number of models: {len(sample_locations)}")
print_time(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds")

NYC Taxi using 3 file(s)!
sample locations: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 105, 106, 107, 108, 109, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217

Read progress: 100%|██████████████████████████████| 3/3 [01:28<00:00, 29.55s/it]


Data loading time: 95.47 seconds


Shuffle Map: 100%|████████████████████████████████| 3/3 [00:04<00:00,  1.48s/it]
Shuffle Reduce: 100%|█████████████████████████████| 3/3 [00:11<00:00,  3.87s/it]
Sort Sample: 100%|████████████████████████████████| 3/3 [00:00<00:00,  3.62it/s]
Shuffle Map: 100%|████████████████████████████████| 3/3 [00:06<00:00,  2.19s/it]
Shuffle Reduce:  67%|███████████████████▎         | 2/3 [00:01<00:01,  1.12s/it][2m[36m(raylet)[0m Spilled 2350 MiB, 21 objects, write throughput 205 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
Shuffle Reduce: 100%|█████████████████████████████| 3/3 [00:03<00:00,  1.07s/it]
Map_Batches:  33%|██████████▋                     | 1/3 [00:20<00:40, 20.45s/it]

[2m[36m(_map_block_nosplit pid=99548)[0m Training time for LocID 263: 1.02 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 186: 0.35 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 187: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 188: 0.02 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 189: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 190: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 191: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 192: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 193: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 194: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 195: 0.01 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training t

Map_Batches:  67%|████████████████████▋          | 2/3 [05:25<03:08, 188.07s/it]

[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 261: 0.04 seconds
[2m[36m(_map_block_nosplit pid=99552)[0m Training time for LocID 262: 0.03 seconds


[2m[36m(_map_block_nosplit pid=99553)[0m 2022-10-12 12:58:00,840	INFO worker.py:756 -- Task failed with retryable exception: TaskID(6efb86ef2d286c40ffffffffffffffffffffffff01000000).
[2m[36m(_map_block_nosplit pid=99553)[0m Traceback (most recent call last):
[2m[36m(_map_block_nosplit pid=99553)[0m   File "/Users/christy/miniforge3/envs/rllib/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc
[2m[36m(_map_block_nosplit pid=99553)[0m     return self._engine.get_loc(casted_key)
[2m[36m(_map_block_nosplit pid=99553)[0m   File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
[2m[36m(_map_block_nosplit pid=99553)[0m   File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
[2m[36m(_map_block_nosplit pid=99553)[0m   File "pandas/_libs/hashtable_class_helper.pxi", line 2131, in pandas._libs.hashtable.Int64HashTable.get_item
[2m[36m(_map_block_nosplit pid=99553)[0m   File "pandas/_libs/

KeyboardInterrupt: 

In [24]:
# Sort results ascending by error

print(type(results))

# sort values by ascending error
results_df = results.to_pandas(limit=results.count())
results_df.sort_values(by=["error"], ascending=True, inplace=True)
results_df

<class 'ray.data.dataset.Dataset'>


Unnamed: 0,location_id,model,error
0,39,LinearRegression(),426.493309
2,233,LinearRegression(),466.17403
1,85,LinearRegression(),529.220534
