# Introduction to the Ray AI Libraries

Let's start with a quick end-to-end example to get a sense of what the Ray AI Libraries can do.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 1:</b> Overview of the Ray AI Libraries</a></li>
    <li><b>Part 2:</b> Quick end-to-end example</a></li>
</ul>
</div>


## 1. Overview of the Ray AI Libraries

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_AI_Libraries/Ray+AI+Libraries.png" width="70%" loading="lazy">

Built on top of Ray Core, the Ray AI Libraries inherit all the performance and scalability benefits offered by Core while providing a convenient abstraction layer for machine learning. These Python-first native libraries allow ML practitioners to distribute individual workloads, end-to-end applications, and build custom use cases in a unified framework.

The Ray AI Libraries bring together an ever-growing ecosystem of integrations with popular machine learning frameworks to create a common interface for development.

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Introduction_to_Ray_AIR/e2e_air.png" width="100%" loading="lazy">|
|:-:|
|Ray AI Libraries enable end-to-end ML development and provides multiple options for integrating with other tools and libraries form the MLOps ecosystem.|



## 2. Quick end-to-end example

|Ray AIR Component|NYC Taxi Use Case|
|:--|:--|
|Ray Data|Ingest and transform raw data; perform batch inference by mapping the checkpointed model to batches of data.|
|Ray Train|Use `Trainer` to scale XGBoost model training.|
|Ray Tune|Use `Tuner` for hyperparameter search.|
|Ray Serve|Deploy the model for online inference.|

For this classification task, you will apply a simple [XGBoost](https://xgboost.readthedocs.io/en/stable/) (a gradient boosted trees framework) model to the June 2021 [New York City Taxi & Limousine Commission's Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). This dataset contains over 2 million samples of yellow cab rides, and the goal is to predict whether a trip will result in a tip greater than 20% or not.

**Dataset features**
* **`passenger_count`**
    * Float (whole number) representing number of passengers.
* **`trip_distance`** 
    * Float representing trip distance in miles.
* **`fare_amount`**
    * Float representing total price including tax, tip, fees, etc.
* **`trip_duration`**
    * Integer representing seconds elapsed.
* **`hour`**
    * Hour that the trip started.
    * Integer in the range `[0, 23]`
* **`day_of_week`**
    * Integer in the range `[1, 7]`.
* **`is_big_tip`**
    * Whether the tip amount was greater than 20%.
    * Boolean `[True, False]`.

__Import libraries__

In [None]:
import json
import pandas as pd
import requests
import xgboost
from starlette.requests import Request

import ray
from ray import tune
from ray.train import ScalingConfig, RunConfig
from ray.train.xgboost import XGBoostTrainer
from ray.tune import Tuner, TuneConfig
from ray import serve

__Read, preprocess with Ray Data__

In [None]:
# Read the dataset
dataset = ray.data.read_parquet("s3://anonymous@anyscale-training-data/intro-to-ray-air/nyc_taxi_2021.parquet")

# Split the dataset into training and validation sets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

__Fit model with Ray Train__

In [None]:
# Define the trainer
trainer = XGBoostTrainer(
    label_column="is_big_tip",
    scaling_config=ScalingConfig(num_workers=4, use_gpu=False),
    params={"objective": "binary:logistic"},
    datasets={"train": train_dataset, "valid": valid_dataset},
    run_config=RunConfig(storage_path="/mnt/cluster_storage/"),
)

# Fit the trainer
result = trainer.fit()

__Optimize hyperparameters with Ray Tune__

In [None]:
# Define the tuner
tuner = Tuner(
    trainer,
    param_space={"params": {"max_depth": tune.randint(2, 12)}},
    tune_config=TuneConfig(num_samples=3, metric="valid-logloss", mode="min"),
    run_config=RunConfig(storage_path="/mnt/cluster_storage/"),
)

# Fit the tuner and get the best checkpoint
checkpoint = tuner.fit().get_best_result().checkpoint

__Batch inference with Ray Data__

In [None]:
class OfflinePredictor:
    def __init__(self):
        # Load expensive state
        self._model = xgboost.Booster()
        self._model.load_model(checkpoint.path + "/model.ubj")

    def __call__(self, batch: dict) -> dict:
        # Make prediction in batch
        dmatrix = xgboost.DMatrix(pd.DataFrame(batch))
        outputs = self._model.predict(dmatrix)
        return {"prediction": outputs}

In [None]:
# Apply the predictor to the validation dataset
valid_dataset_inputs = valid_dataset.drop_columns(['is_big_tip'])
predicted_probabilities = valid_dataset_inputs.map_batches(OfflinePredictor, concurrency=2)

In [None]:
# Materialize a batch
predicted_probabilities.take_batch()

__Online prediction with Ray Serve__

In [None]:
@serve.deployment
class OnlinePredictor:
    def __init__(self, checkpoint):
        # Load expensive state
        self._model = xgboost.Booster()
        self._model.load_model(checkpoint.path + "/model.ubj")

    async def __call__(self, request: Request) -> dict:
        # Handle HTTP request
        data = await request.json()
        data = json.loads(data)
        return {"prediction": self.predict(data)}

    def predict(self, data: list[dict]) -> list[float]:
        # Make prediction
        dmatrix = xgboost.DMatrix(pd.DataFrame(data))
        return self._model.predict(dmatrix)

# Run the deployment
handle = serve.run(OnlinePredictor.bind(checkpoint=checkpoint))

In [None]:
# Form payload
valid_dataset_inputs = valid_dataset.drop_columns(["is_big_tip"])
sample_batch = valid_dataset_inputs.take_batch(1)
data = pd.DataFrame(sample_batch).to_json(orient="records")

# Send HTTP request
requests.post("http://localhost:8000/", json=data).json()

In [None]:
# Shutdown Ray Serve
serve.shutdown()

In [None]:
# Cleanup
!rm -rf /mnt/cluster_storage/XGBoostTrainer*