# NYC Taxi Example Playground
***
This is the notebook where Emmy tests out her NYC Taxi code before taking the cleaned up bits for the tutorial on Intro to Ray AIR.

In [8]:
# import your packages
import ray

if ray.is_initialized:
    ray.shutdown()

ray.init()

2022-10-21 11:04:01,341	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


0,1
Python version:,3.8.13
Ray version:,2.0.0
Dashboard:,http://127.0.0.1:8265


In [9]:
# we use the June 2021 dataset for training and the June 2022 dataset for batch inference later
dataset = ray.data.read_parquet("data/nyc_taxi_2021.parquet")

# split data into training and validation subsets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
# valid_dataset = valid_dataset.drop_columns(["is_big_tip"])

# repartition the dataset for maximum parallelism
# train_dataset.repartition(100)
# valid_dataset.repartition(100)

Read progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.30it/s]


In [10]:
train_dataset.take(2)

[ArrowRow({'passenger_count': 1.0,
           'trip_distance': 0.9,
           'fare_amount': 5.0,
           'trip_duration': 228,
           'hour': 0,
           'day_of_week': 1,
           'is_big_tip': True,
           '__index_level_0__': 0}),
 ArrowRow({'passenger_count': 1.0,
           'trip_distance': 23.0,
           'fare_amount': 61.5,
           'trip_duration': 2081,
           'hour': 0,
           'day_of_week': 1,
           'is_big_tip': False,
           '__index_level_0__': 1})]

In [11]:
valid_dataset.take(2)

[ArrowRow({'passenger_count': 1.0,
           'trip_distance': 1.2,
           'fare_amount': 8.5,
           'trip_duration': 611,
           'hour': 12,
           'day_of_week': 1,
           'is_big_tip': False,
           '__index_level_0__': 1897262}),
 ArrowRow({'passenger_count': 1.0,
           'trip_distance': 1.4,
           'fare_amount': 6.5,
           'trip_duration': 351,
           'hour': 12,
           'day_of_week': 1,
           'is_big_tip': False,
           '__index_level_0__': 1897263})]

So something we might want to do is inspect both "trip_distance" and "trip_duration" to see if they're approximately normal, and if the StandardScaler is the right choice.

In [12]:
# we're going to use MinMaxScaler becaues we aren't sure what the data looks like.
# this scales each column by its range, but maybe we want to cut off some really long trip durations and distances?

from ray.data.preprocessors import MinMaxScaler

# create a preprocessor to scale some columns
preprocessor = MinMaxScaler(columns=["trip_distance", "trip_duration"])

So something to change is that the code in the snippet shows a TorchTrainer, which is maybe not what we're going for here. Another thing is that we've now arrived at the code that we don't understand, so you gotta break it down.

In [13]:
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

trainer = XGBoostTrainer(
    label_column="is_big_tip",
    params={"objective": "binary:logistic", "eval_metric": ["logloss", "error"], "tree_method": "approx"},
    scaling_config=ScalingConfig(num_workers=6),
    datasets={"train": train_dataset, "valid": valid_dataset},
    preprocessor=preprocessor,
    num_boost_round=10
)

In [14]:
result = trainer.fit()

Trial name,status,loc,iter,total time (s),train-logloss,train-error,valid-logloss
XGBoostTrainer_cac52_00000,TERMINATED,127.0.0.1:74009,11,8.31948,0.659297,0.390496,0.659948


[2m[36m(_RemoteRayXGBoostActor pid=74046)[0m [11:04:30] task [xgboost.ray]:5034585488 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=74050)[0m [11:04:30] task [xgboost.ray]:5002440080 got new rank 5
[2m[36m(_RemoteRayXGBoostActor pid=74049)[0m [11:04:30] task [xgboost.ray]:6098730288 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=74048)[0m [11:04:30] task [xgboost.ray]:5441531232 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=74051)[0m [11:04:30] task [xgboost.ray]:6006983840 got new rank 4
[2m[36m(_RemoteRayXGBoostActor pid=74047)[0m [11:04:30] task [xgboost.ray]:5178948912 got new rank 2


Result for XGBoostTrainer_cac52_00000:
  date: 2022-10-21_11-04-31
  done: false
  experiment_id: 43fce90a3459450b9f074732872d4fa2
  hostname: Juless-MacBook-Pro-16
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  pid: 74009
  time_since_restore: 7.165406942367554
  time_this_iter_s: 7.165406942367554
  time_total_s: 7.165406942367554
  timestamp: 1666375471
  timesteps_since_restore: 0
  train-error: 0.3931747254854014
  train-logloss: 0.6777983466318428
  training_iteration: 1
  trial_id: cac52_00000
  valid-error: 0.3921589407890845
  valid-logloss: 0.6778343589824566
  warmup_time: 0.002437114715576172
  
Result for XGBoostTrainer_cac52_00000:
  date: 2022-10-21_11-04-32
  done: true
  experiment_id: 43fce90a3459450b9f074732872d4fa2
  experiment_tag: '0'
  hostname: Juless-MacBook-Pro-16
  iterations_since_restore: 11
  node_ip: 127.0.0.1
  pid: 74009
  time_since_restore: 8.319478034973145
  time_this_iter_s: 0.24701213836669922
  time_total_s: 8.319478034973145
  timestamp: 1

2022-10-21 11:04:32,771	INFO tune.py:758 -- Total run time: 9.91 seconds (9.80 seconds for the tuning loop).
