# NYC Taxi Example Playground
***
This is the notebook where Emmy tests out her NYC Taxi code before taking the cleaned up bits for the tutorial on Intro to Ray AIR.

In [1]:
# import your packages
import ray
import pandas as pd

if ray.is_initialized:
    ray.shutdown()

ray.init()

2022-10-26 09:24:18,252	INFO worker.py:1518 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


0,1
Python version:,3.8.13
Ray version:,3.0.0.dev0
Dashboard:,http://127.0.0.1:8265


In [7]:
df = pd.read_parquet("data/nyc_taxi_2021.parquet")
dataset = ray.data.from_pandas(df)

In [3]:
# we use the June 2021 dataset for training and the June 2022 dataset for batch inference later
dataset = ray.data.read_parquet("data/nyc_taxi_2021.parquet")

# split data into training and validation subsets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
#valid_dataset = valid_dataset.drop_columns(["is_big_tip"])

# repartition the dataset for maximum parallelism
# train_dataset.repartition(100)
# valid_dataset.repartition(100)

Parquet Files Sample:   0%|          | 0/1 [00:00<?, ?it/s]
Parquet Files Sample: 100%|██████████| 1/1 [00:00<00:00,  6.50it/s]                                                                                                                                        | 0/1 [00:00<?, ?it/s][2m[36m(_get_read_tasks pid=75024)[0m 
Read progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.89it/s]


In [9]:
train_dataset.take(2)

[PandasRow({'passenger_count': 1.0,
            'trip_distance': 0.9,
            'fare_amount': 5.0,
            'trip_duration': 228,
            'hour': 0,
            'day_of_week': 1,
            'is_big_tip': True}),
 PandasRow({'passenger_count': 1.0,
            'trip_distance': 23.0,
            'fare_amount': 61.5,
            'trip_duration': 2081,
            'hour': 0,
            'day_of_week': 1,
            'is_big_tip': False})]

In [10]:
valid_dataset.take(2)

[PandasRow({'passenger_count': 1.0,
            'trip_distance': 1.2,
            'fare_amount': 8.5,
            'trip_duration': 611,
            'hour': 12,
            'day_of_week': 1,
            'is_big_tip': False}),
 PandasRow({'passenger_count': 1.0,
            'trip_distance': 1.4,
            'fare_amount': 6.5,
            'trip_duration': 351,
            'hour': 12,
            'day_of_week': 1,
            'is_big_tip': False})]

So something we might want to do is inspect both "trip_distance" and "trip_duration" to see if they're approximately normal, and if the StandardScaler is the right choice.

In [11]:
# we're going to use MinMaxScaler becaues we aren't sure what the data looks like.
# this scales each column by its range, but maybe we want to cut off some really long trip durations and distances?

from ray.data.preprocessors import MinMaxScaler

# create a preprocessor to scale some columns
preprocessor = MinMaxScaler(columns=["trip_distance", "trip_duration"])

So something to change is that the code in the snippet shows a TorchTrainer, which is maybe not what we're going for here. Another thing is that we've now arrived at the code that we don't understand, so you gotta break it down.

In [12]:
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

trainer = XGBoostTrainer(
    label_column="is_big_tip",
    params={"objective": "binary:logistic", "eval_metric": ["logloss", "error"], "tree_method": "approx"},
    scaling_config=ScalingConfig(num_workers=6),
    datasets={"train": train_dataset, "valid": valid_dataset},
    preprocessor=preprocessor,
    num_boost_round=10
)

In [13]:
result = trainer.fit()

0,1
Current time:,2022-10-26 07:41:06
Running for:,00:00:10.22
Memory:,25.6/64.0 GiB

Trial name,status,loc,iter,total time (s),train-logloss,train-error,valid-logloss
XGBoostTrainer_32c5c_00000,TERMINATED,127.0.0.1:45528,11,8.69959,0.659297,0.390496,0.659948


[2m[36m(_RemoteRayXGBoostActor pid=45595)[0m [07:41:03] task [xgboost.ray]:4824755216 got new rank 4
[2m[36m(_RemoteRayXGBoostActor pid=45591)[0m [07:41:03] task [xgboost.ray]:5289667792 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=45593)[0m [07:41:03] task [xgboost.ray]:5021363408 got new rank 2
[2m[36m(_RemoteRayXGBoostActor pid=45592)[0m [07:41:03] task [xgboost.ray]:5185596528 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=45596)[0m [07:41:03] task [xgboost.ray]:4830604400 got new rank 5
[2m[36m(_RemoteRayXGBoostActor pid=45594)[0m [07:41:03] task [xgboost.ray]:4949552432 got new rank 3


Trial name,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,node_ip,pid,should_checkpoint,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,train-error,train-logloss,training_iteration,trial_id,valid-error,valid-logloss,warmup_time
XGBoostTrainer_32c5c_00000,2022-10-26_07-41-06,True,,7513e541876c4296aa9fb35d14e49e45,0,Juless-MacBook-Pro-16,11,127.0.0.1,45528,True,8.69959,0.694017,8.69959,1666795266,0,,0.390496,0.659297,11,32c5c_00000,0.390153,0.659948,0.00540495


2022-10-26 07:41:06,296	INFO tune.py:787 -- Total run time: 10.64 seconds (10.21 seconds for the tuning loop).
