In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# NYC Taxi Fare Prediction - Model Training and Deployment

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fpython%2Fapache_beam%2Fyaml%2Fexamples%2Ftransforms%2Fml%2Ftaxi_fare%2Fcustom_nyc_taxifare_model_deployment.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare/custom_nyc_taxifare_model_deployment.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>


## Overview

This notebook demonstrates the training and deployment of a custom tabular regression model for online prediction.

We'll train a [gradient-boosted decision tree (GBDT) model](https://en.wikipedia.org/wiki/Gradient_boosting) using [XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) to predict the fare of a taxi trip in New York City, given the information such as pick-up date and time, pick-up location, drop-off location and passenger count. The dataset is from the Kaggle competition https://www.kaggle.com/c/new-york-city-taxi-fare-prediction organized by Google Cloud.

After model training and evaluation, we'll use Vertex AI Python SDK to upload this custom model to Vertex AI Model Registry and deploy it to perform remote inferences at scale. The prefered way to run this notebook is within Colab Enterprise.

## Outline
1. Dataset

2. Training

3. Evaluation

4. Deployment

5. Reference

We first install and import the necessary libraries to run this notebook.

In [None]:
!pip3 install --quiet --upgrade \
  opendatasets \
  google-cloud-storage \
  google-cloud-aiplatform \
  scikit-learn \
  xgboost \
  pandas

In [None]:
import opendatasets as od
import pandas as pd
import random
import time
import os

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

import google.cloud.storage as storage
import google.cloud.aiplatform as vertex

## Dataset

We use the `opendatasets` library to programmatically download the dataset from Kaggle.

We'll first need a Kaggle account and register for this competition. We'll also need the API key which is stored in `kaggle.json` file automatically downloaded when you create an API token. Go to *Profile* picture -> *Settings* -> *API* -> *Create New Token*.

The dataset download will prompt you to enter your Kaggle username and key. Copy this information from `kaggle.json`.


In [None]:
dataset_url = 'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'
od.download(dataset_url)

Among the downloaded files, we will only make use of `test.csv` testing dataset and primarily `train.csv` training dataset for the purpose of training and evaluating our model.

In [None]:
data_dir = 'new-york-city-taxi-fare-prediction'
!dir -l {data_dir}

The training dataset contains approx. 55M rows. Reading the entire dataset into a pandas DataFrame (i.e. loading the entire dataset into memory) is slow and memory-consuming that can affect operations in later parts of the notebook. And for the purpose of experimenting with our model, it is also unnecessary.

A good practice is to sample some percentage of the training dataset.

In [None]:
p = 0.01
# keep the header, then take only 1% of rows
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df_train_val = pd.read_csv(
    data_dir + "/train.csv",
    header=0,
    parse_dates = ['pickup_datetime'],
    skiprows=lambda i: i > 0 and random.random() > p
)
df_train_val.shape

The training dataset, now as a DataFrame table, can be further inspected.

In [None]:
df_train_val.columns

In [None]:
df_train_val.info()

In [None]:
df_train_val

The testing dataset is a lot smaller in size and doesn't have the `fare_amount` column. Likewise, we can read the dataset as a DataFrame and inspect the data.

In [None]:
df_test = pd.read_csv(data_dir + "/test.csv", parse_dates = ['pickup_datetime'])
df_test.columns

In [None]:
df_test

We'll set aside 20% of the training data as the validation set, to evaluate the model on previously unseen data.

In [None]:
df_train, df_val = train_test_split(
    df_train_val,
    test_size=0.2,
    random_state=42 # set random_state to some constant so we always have the same training and validation data
)

print("Training dataset's shape: ", df_train.shape)
print("Validation dataset's shape: ", df_val.shape)

## Training

For a quick '0-to-1' model serving on Vertex AI, the model training process
below is kept straighforward using the simple yet very effective [tree-based, gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm. We start off with a simple feature engineering idea, before moving on to the actual training of the model using the [XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) library.


### Simple Feature Engineering

One of the columns in the dataset is the `pickup_datetime` column, which is of [datetime like](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) type. This makes it incredibly easy for performing data analysis on time-series data such as this. However, ML models don't accept feature columns with such a custom data type that is not a number. Some sort of conversion is needed, and here we'll choose to break this datetime column into multiple feature columns.


In [None]:
def add_dateparts(df, col):
    """
    This function splits the datetime column into separate column such as
    year, month, day, weekday, and hour
    :param df: DataFrame table to add the columns
    :param col: the column with datetime values
    :return: None
    """
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

In [None]:
add_dateparts(df_train, 'pickup_datetime')
add_dateparts(df_val, 'pickup_datetime')
add_dateparts(df_test, 'pickup_datetime')

In [None]:
df_train.info()

In [None]:
df_train.head()

### Gradient Boosting

Predicting taxi fare is a supervised learning, regression problem and our dataset is tabular. It is well-known in common literatures (_[1]_, _[2]_) that [gradient-boosted decision tree (GBDT) model](https://en.wikipedia.org/wiki/Gradient_boosting) performs very well for this kind of problem and dataset type.

The input columns used for training (and subsequently for inference) will be the original feature columns (pick-up and drop-off longitude/latitude and the passenger count) from the dataset, along with the additional engineered features (pick-up year, month, day, etc...) that we generated above. The target/label column for training is the `fare_amount` column.


In [None]:
input_cols = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
              'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day', 'pickup_datetime_weekday',
              'pickup_datetime_hour']

target_cols = 'fare_amount'

train_inputs = df_train[input_cols]
train_targets = df_train[target_cols]

val_inputs = df_val[input_cols]
val_targets = df_val[target_cols]

test_inputs = df_test[input_cols]

As noted before, we use the XGBoost library which implements the GBDT machine learning algorithm in a scalable, distributed manner. Specifically,
we'll use the [XGBoostRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor) API and fit the
training data by the squared-error loss function. The hyperparameters chosen here are simply through trial-and-error to see which one gives the best result.

In [None]:
xgb_model = XGBRegressor(objective='reg:squarederror',
                         n_jobs=-1,
                         random_state=42,
                         n_estimators=500,
                         max_depth=5,
                         learning_rate=0.05,
                         tree_method='hist',
                         subsample=0.8,
                         colsample_bytree=0.8)

**Note**: The model should be trained on array-like dataset (e.g. `numpy.ndarray`), instead of pandas DataFrame or Series object. This is to help passing/serializing input data in the request for remote inference later on, and to avoid a DataFrame/array-like mismatch error such as [this](https://datascience.stackexchange.com/questions/63872/lime-explainer-valueerror-training-data-did-not-have-the-following-fields).

In [None]:
xgb_model.fit(train_inputs.values, train_targets.values)

## Evaluation

A typical metric used for model evaluation is the root mean squared error (RMSE).


In [None]:
def evaluate(model):
    """
    :param model: trained model to evaluate
    :return: a tuple of training and validation RMSE results
    """
    train_preds = model.predict(train_inputs)
    train_rmse = root_mean_squared_error(train_targets, train_preds)
    val_preds = model.predict(val_inputs)
    val_rmse = root_mean_squared_error(val_targets, val_preds)

    return train_rmse, val_rmse

training_rmse, validation_rmse = evaluate(xgb_model)
print("Training RMSE: ", training_rmse)
print("Validation RMSE: ", validation_rmse)

We finally make use of the testing dataset by making model inference on this test data. The predicted label is the `predicted_fare_amount` column.

In [None]:
test_preds = xgb_model.predict(test_inputs)
result_df = df_test.copy()
result_df['predicted_fare_amount'] = test_preds

## Deployment

Once the model is finished training and evaluating, the next step is making model serving possible on Vertex AI.

Initialize the Vertex AI SDK for Python for your project.


In [None]:
PROJECT_ID = "your-project-id"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
BUCKET_URI = "gs://your-bucket-name"  # @param {type:"string"}

vertex.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

print(f"Project: {PROJECT_ID} | Region: {REGION}")

Save the trained model to the Google Cloud Storage bucket as a model artifact.

In [None]:
FILE_NAME = "model.bst"
xgb_model.save_model(FILE_NAME)

# Upload the saved model file to GCS
BLOB_PATH = "taxifare_prediction/"

BLOB_NAME = BLOB_PATH + FILE_NAME

bucket = storage.Client().bucket(BUCKET_URI[5:])
blob = bucket.blob(BLOB_NAME)
blob.upload_from_filename(FILE_NAME)

Set the machine type as well as pre-built container image for serving inference.

In [None]:
MODEL_DISPLAY_NAME = f"custom/xgb-model-nyc-taxifare"

ARTIFACT_GCS_PATH = f"{BUCKET_URI}/{BLOB_PATH}"

DEPLOY_VERSION = "xgboost-cpu.2-0"
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

MACHINE_TYPE = "n1-standard"
VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

Upload the model artifact from the GCS bucket to Vertex AI Model Registry.

In [None]:
MODEL_OBJ = vertex.Model.upload(
    display_name = MODEL_DISPLAY_NAME,
    artifact_uri = ARTIFACT_GCS_PATH,
    serving_container_image_uri = DEPLOY_IMAGE,
    serving_container_predict_route = "/predict",
    serving_container_health_route  = "/ping",
    labels = {"framework":"xgboost","demo":"nyc_taxi"}
)

print("Model resource:", MODEL_OBJ.resource_name)

Create a Vertex AI dedicated endpoint for serving inference requests.

In [None]:
ENDPOINT = vertex.Endpoint.create(
    display_name=f"{MODEL_DISPLAY_NAME}-endpoint",
    dedicated_endpoint_enabled=True,
)

Deploy the model from the Model Registry to the dedicated endpoint.

**Note**: This is a long-running operation that will take about 20 minutes to finish.

In [None]:
MODEL_OBJ.deploy(
    endpoint = ENDPOINT,
    machine_type = DEPLOY_COMPUTE,
    deploy_request_timeout=1800,
    traffic_percentage=100
)

print("Endpoint:", ENDPOINT.resource_name)

Run online predictions on some sample inputs.

In [None]:
instances = [val_inputs.iloc[0].to_list(), val_inputs.iloc[1].to_list(), val_inputs.iloc[2].to_list()]
print(instances)

predictions = ENDPOINT.predict(instances)
print("Predicted fares: ", predictions.predictions)
print("Actual fares: ", val_targets.iloc[0:3].to_list())

## Reference

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). Boosting and Additive Trees. In: The Elements of Statistical Learning. Springer Series in Statistics

[2] Murphy, K. P. (2012). Adaptive Basis Function Models. In: Machine learning: a probabilistic perspective. MIT press.

[3] Sample notebooks for Vertex AI workflows: https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/notebooks
