In [1]:
import ray
from ray.air.config import ScalingConfig
from ray.data.preprocessors import MinMaxScaler
from ray.train.xgboost import XGBoostTrainer

### Initialize Ray runtime

In [2]:
ray.init()

2024-05-31 13:38:45,086	INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.10.8
Ray version:,2.23.0
Dashboard:,http://127.0.0.1:8265


[33m(raylet)[0m [2024-05-31 13:38:55,004 E 54985 21150980] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-05-31_13-38-41_897327_54971 is over 95% full, available space: 2525532160; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m bash: /Users/brendan/Desktop: is a directory
[33m(raylet)[0m bash: line 0: exec: /Users/brendan/Desktop: cannot execute: Undefined error: 0
[33m(raylet)[0m [2024-05-31 13:39:05,101 E 54985 21150980] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-05-31_13-38-41_897327_54971 is over 95% full, available space: 2707894272; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-05-31 13:39:15,200 E 54985 21150980] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-05-31_13-38-41_897327_54971 is over 95% full, available space: 2708815872; capacity: 245107195904. Object creation will fail if spilling is required.
[33m(raylet)[0m [2024-05-31 

### Load and prepare data with Ray Datasets

#### Read Parquet file to Ray dataset

In [3]:
dataset = ray.data.read_parquet('yellow_tripdata_2021-01.parquet')

Parquet Files Sample 0:   0%|                             | 0/1 [00:00<?, ?it/s]

Returned `dataset` is Ray Dataset - standrd way to load and exchange data
In AIR, datasets are usex extensively for data loading and transforamtion

#### Split data into training and validation subsets

In [4]:
train_dataset, valid_dataset = dataset.train_test_split(test_size=.3)

Read progress 0:   0%|                                    | 0/1 [00:00<?, ?it/s]

#### Split datasets into blocks for parallel preprocessing

In [5]:
train_dataset = train_dataset.repartition(num_blocks=3)
valid_dataset = valid_dataset.repartition(num_blocks=3)

`num_blocks` should be lower than number of cores in the cluster

#### Define a preprocessor to normalize the columns by their range

In [6]:
preprocessor = MinMaxScaler(columns=['trip_distance','trip_duration'])

`Preprocesors` are primitives that transform input data into featues. They operate on datasets, making them scalable and compatible with a variety of datasources and dataframe librairies.

Ray AI Runtime comes with a collection of built-in preprocessors

### Train the model with Ray Train

#### Create XGBoost trainer

In [7]:
trainer = XGBoostTrainer(
    label_column='is_big_tip',
    num_boost_round=100,
    scaling_config=ScalingConfig(
        use_gpu=False,
    ),
    params={
        'objective':'binary:logistic',
        'eval_metric':['logloss','error'],
        'tree_method':'approx',
    },
    datasets = {'train': train_dataset, 'valid':valid_dataset},
    preprocessor=preprocessor,
)

During training, `trainer` will use `num_blocks` workers, defined when repartitioning dataset

#### Invoke training

In [8]:
result = trainer.fit()

2024-05-31 13:38:50,986	INFO tune.py:614 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949


== Status ==
Current time: 2024-05-31 13:38:58 (running for 00:00:00.13)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/8 CPUs, 0/0 GPUs
Result logdir: /tmp/ray/session_2024-05-31_13-38-41_897327_54971/artifacts/2024-05-31_13-38-50/XGBoostTrainer_2024-05-31_13-38-50/driver_artifacts
Number of trials: 1/1 (1 PENDING)


== Status ==
Current time: 2024-05-31 13:39:03 (running for 00:00:05.23)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/8 CPUs, 0/0 GPUs
Result logdir: /tmp/ray/session_2024-05-31_13-38-41_897327_54971/artifacts/2024-05-31_13-38-50/XGBoostTrainer_2024-05-31_13-38-50/driver_artifacts
Number of trials: 1/1 (1 PENDING)


== Status ==
Current time: 2024-05-31 13:39:08 (running for 00:00:10.24)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/8 CPUs, 0/0 GPUs
Result logdir: /tmp/ray/session_2024-05-31_13-38-41_897327_54971/artifacts/2024-05-31_13-38-50/XGBoostTrainer_2024-05-31_13-38-50/driver_artifacts
Number of trials: 1/1 (1 PENDING)


You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).
2024-05-31 13:40:19,162	INFO tune.py:1007 -- Wrote the latest version of all result files and experiment state to '/Users/brendan/ray_results/XGBoostTrainer_2024-05-31_13-38-50' in 0.0081s.


== Status ==
Current time: 2024-05-31 13:40:19 (running for 00:01:20.88)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/8 CPUs, 0/0 GPUs
Result logdir: /tmp/ray/session_2024-05-31_13-38-41_897327_54971/artifacts/2024-05-31_13-38-50/XGBoostTrainer_2024-05-31_13-38-50/driver_artifacts
Number of trials: 1/1 (1 PENDING)


== Status ==
Current time: 2024-05-31 13:40:19 (running for 00:01:20.89)
Using FIFO scheduling algorithm.
Logical resource usage: 2.0/8 CPUs, 0/0 GPUs
Result logdir: /tmp/ray/session_2024-05-31_13-38-41_897327_54971/artifacts/2024-05-31_13-38-50/XGBoostTrainer_2024-05-31_13-38-50/driver_artifacts
Number of trials: 1/1 (1 PENDING)




2024-05-31 13:40:29,216	INFO tune.py:1039 -- Total run time: 98.23 seconds (80.88 seconds for the tuning loop).
Resume training with: <FrameworkTrainer>.restore(path="/Users/brendan/ray_results/XGBoostTrainer_2024-05-31_13-38-50", ...)
- XGBoostTrainer_a498a_00000: FileNotFoundError('Could not fetch metrics for XGBoostTrainer_a498a_00000: both result.json and progress.csv were not found at /Users/brendan/ray_results/XGBoostTrainer_2024-05-31_13-38-50/XGBoostTrainer_a498a_00000_0_2024-05-31_13-38-58')


#### Report results

#### Shutdown Ray runtime

In [9]:
ray.shutdown()