# Online Prediction with XGBoost on AI Platform
This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple XGBoost model, upload the model to AI Platform, and query it for predictions. 

# How to bring your model to AI Platform
Getting your model ready for predictions can be done in 5 steps:
1. Save your model to a file
2. Upload the saved model to [Google Cloud Storage](https://cloud.google.com/storage)
3. Create a model resource on AI Platform
4. Create a model version (linking your XGBoost model)
5. Make an online prediction

# Prerequisites
Before we begin, let’s cover some of the different tools you’ll use to get online prediction up and running on AI Platform. 

[Google Cloud Platform](https://cloud.google.com/) (GCP) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.

[AI Platform](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is [installed](https://cloud.google.com/sdk/downloads) in the same environment as your Jupyter kernel.


# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a GCS Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable AI Platform Training and Prediction and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)
* [Install Cloud SDK](https://cloud.google.com/sdk/downloads)
* [Install XGBoost](http://xgboost.readthedocs.io/en/latest/build.html)
* [Install scikit-learn](http://scikit-learn.org/stable/install.html)
* [Install NumPy](https://docs.scipy.org/doc/numpy/user/install.html)
* [Install pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)
* [Install Google API Python Client](https://github.com/google/google-api-python-client)


These variables will be needed for the exercise.

In the cell below, **replace** the following highlighted elements:
* `project <PROJECT_ID>` - with this project id (i.e. ai-platform-demo)
* `bucket <BUCKET_ID>` - with your student id (i.e. maven-student01)
* `folder <FOLDER>` - with something about this exercise (i.e. census_income)
* `region <REGION>` - with the correct region (i.e. us-central1) (See: https://cloud.google.com/ai-platform/training/docs/regions)

In [1]:
# Replace <PROJECT_ID>, <BUCKET_ID>, and <FOLDER> with proper Project, Bucket ID, and Folder.
project = '<PROJECT_ID>'
bucket = '<BUCKET_ID>'
folder='census-income'
region='us-central1'

In [2]:
bucket_path=f'{bucket}/{folder}'
%env PROJECT_ID=$project
%env BUCKET_ID=$bucket
%env BUCKET_PATH=$bucket_path
%env REGION=$region
!gsutil mb -c standard -l {region} gs://{bucket}

env: PROJECT_ID=ai-fulcrum-admin
env: BUCKET_ID=maven-user1
env: BUCKET_PATH=maven-user1/census-income
env: REGION=us-central1
Creating gs://maven-user1/...


## Download the data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is hosted by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/):

 * Training file is `adult.data`
 * Evaluation file is `adult.test`


### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

In [5]:
# Download the data from it's location to your bucket
!gsutil cp gs://amazing-public-data/census_income/census_income_data_adult.data gs://${BUCKET_PATH}/adult.data
!gsutil cp gs://amazing-public-data/census_income/census_income_data_adult.test gs://${BUCKET_PATH}/adult.test

Copying gs://amazing-public-data/census_income/census_income_data_adult.data [Content-Type=application/octet-stream]...
/ [1 files][  3.8 MiB/  3.8 MiB]                                                
Operation completed over 1 objects/3.8 MiB.                                      
Copying gs://amazing-public-data/census_income/census_income_data_adult.test [Content-Type=application/octet-stream]...
/ [1 files][  1.9 MiB/  1.9 MiB]                                                
Operation completed over 1 objects/1.9 MiB.                                      


In [6]:
import os
import json
import numpy as np
import pandas as pd
from tensorflow.python.lib.io import file_io
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# categorical columns contain data that need to be turned into numerical values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country"
)

bucket_name=os.environ['BUCKET_PATH']

# load training set
with file_io.FileIO(f"gs://{bucket_name}/adult.data", "r") as train_data:
    raw_training_data = pd.read_csv(train_data)
# remove column we are trying to predict ('income') from features list
train_features = raw_training_data.drop("income", axis=1)
# create training labels list
train_labels = raw_training_data["income"] == " >50K"

# load test set
with file_io.FileIO(f"gs://{bucket_name}/adult.test", "r") as test_data:
    raw_testing_data = pd.read_csv(test_data, skiprows=[1])
# remove column we are trying to predict ('income') from features list
test_features = raw_testing_data.drop("income", axis=1)
# create training labels list
test_labels = raw_testing_data["income"] == " >50K."

# convert data in categorical columns to numerical values
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

# For use to verify results as an optional step.
data = []
for i in range(len(test_features)):
    data.append([])
    for col in train_features.columns: # ignore 'income' column as it isn't in feature set.
        # convert from numpy integers to standard integers
        data[i].append(int(np.uint64(test_features[col][i]).item()))

# write the test data to a json file
with open('test_data.json', 'w') as outfile:
    json.dump(data, outfile)
    
# get one person that makes <=50K and one that makes >50K to test our model.
print('Show a person that makes <=50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(data[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
    json.dump(data[0], outfile)

print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(data[2], test_labels[2]))

with open('more_than_50K.json', 'w') as outfile:
    json.dump(data[2], outfile)

Show a person that makes <=50K:
	Features: [25, 4, 226802, 1, 7, 4, 7, 3, 2, 1, 0, 0, 40, 38] --> Label: False

Show a person that makes >50K:
	Features: [28, 2, 336951, 7, 12, 2, 11, 0, 4, 1, 0, 0, 40, 38] --> Label: True


In [7]:
import xgboost as xgb
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)

# train XGBoost model
bst = xgb.train({'objective':'reg:logistic'}, dtrain, num_boost_round=20)
bst.save_model('./model.bst')

print('model trained and saved')

model trained and saved


Now that the model has been saved locally, let's run it on the first (less than 50k) and third (greater than 50k) data elements in the array.

In [8]:
bst.predict(dtest)[[0, 2]]

array([0.00463351, 0.34860396], dtype=float32)

# Part 1: Submit a Local Job to Train/Save a Model and Make a Prediction
Local jobs are generally used for debugging purposes.

## Create your Python model file
We have created the Python model file (inside trainer folder) that we'll upload to AI Platform. This is similar to your normal process for creating an XGBoost model. However, there are a few key differences:
1. Downloading the data from GCS at the start of your file, so that AI Platform can access the data.
1. Exporting/saving the model to GCS at the end of your file, so that you can use it for predictions.
1. Define a command-line argument in your main training module for AI Platform parameters

The code in this file first handles the parameters passed to the file from AI Platform. Then it loads the data into a pandas DataFrame that can be used by XGBoost. Then the model is fit against the training data. Lastly, the model is saved to a file that can be uploaded to [AI Platform's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions).

Note: In normal practice you would want to test your model locally on a small dataset to ensure that it works, before using it with your larger dataset on AI Platform. This avoids wasted time and costs. This is displayed below, as well.

### Train and Save the Model
First, the data is loaded into a pandas DataFrame. Then a simple model is created with the training set. Lastly, the model is saved to a .bst file that can then be uploaded to AI Platform.

In [9]:
!gcloud ai-platform local train \
  --package-path trainer \
  --module-name trainer.task \
  -- \
  --bucket_name $BUCKET_PATH

Copying file://model.bst [Content-Type=application/octet-stream]...
/ [1 files][ 63.0 KiB/ 63.0 KiB]                                                
Operation completed over 1 objects/63.0 KiB.                                     


### Test Data Preparation
Before you begin predicting , you'll need to take some of the test data and prepare it, so that the test data can be used by the deployed model.

To get predictions, the data needs to be converted from a numpy array to a json array.

### Make Predictions Using the Saved Model

##### This tests for "greater than 50K" on a record, which has a person making less than 50K.

In [10]:
!gcloud ai-platform local predict \
  --model-dir gs://${BUCKET_PATH}/model/ \
  --json-instances less_than_50K.json \
  --framework xgboost \
  --signature-name 'census_income_model'
#   --verbosity debug

/ [1 files][ 63.0 KiB/ 63.0 KiB]                                                
Operation completed over 1 objects/63.0 KiB.                                     
Copying gs://maven-user1/census-income/model/model.bst...
/ [1 files][ 63.0 KiB/ 63.0 KiB]                                                
Operation completed over 1 objects/63.0 KiB.                                     

[0.004633505828678608]


##### This tests for "greater than 50K" on a record, which has a person making more than 50K.

In [11]:
!gcloud ai-platform local predict \
  --model-dir gs://${BUCKET_PATH}/model/ \
  --json-instances more_than_50K.json \
  --framework xgboost \
  --signature-name 'census_income_model'
#   --verbosity debug

/ [1 files][ 63.0 KiB/ 63.0 KiB]                                                
Operation completed over 1 objects/63.0 KiB.                                     
Copying gs://maven-user1/census-income/model/model.bst...
/ [1 files][ 63.0 KiB/ 63.0 KiB]                                                
Operation completed over 1 objects/63.0 KiB.                                     

[0.3486039638519287]


# Part 2: Create a job and run training on AI Platform
Next we need to create a job for training on AI Platform. We'll use gcloud to submit the job which has the following flags:

* `job-name` - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: `census_income_job_$(date +"%Y%m%d_%H%M%S")`
* `job-dir` - The path to a Google Cloud Storage location to use for job output.
* `package-path` - A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated.
* `module-name` - The name of the main module in your trainer package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name argument. Refer to Python Packages to figure out the module name.
* `region` - The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. Select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'.
* `runtime-version` - The version of AI Platform to use for the job. If you don't specify a runtime version, the training service uses the default AI Platform runtime version 1.0. See the list of runtime versions for more information.
* `python-version` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.
* Custom parameters used in the Python file


Note: Check to make sure gcloud is set to the current PROJECT_ID

In [12]:
%env PACKAGE_PATH=trainer
%env MODULE_NAME=trainer.task
%env RUNTIME_VERSION=2.1
%env PYTHON_VERSION=3.7

env: PACKAGE_PATH=trainer
env: MODULE_NAME=trainer.task
env: RUNTIME_VERSION=2.1
env: PYTHON_VERSION=3.7


In [13]:
import time
from datetime import datetime, timedelta

In [14]:
now=(datetime.now() + timedelta(hours=-5)).strftime("%Y%m%d_%H%M%S") # Central Time
%env JOB_NAME=census_income_job_{now}

!gcloud ai-platform jobs submit training $JOB_NAME  \
  --job-dir gs://${BUCKET_PATH}/jobdir \
  --package-path $PACKAGE_PATH \
  --module-name $MODULE_NAME \
  --region $REGION \
  --runtime-version $RUNTIME_VERSION \
  --python-version $PYTHON_VERSION \
  -- \
  --bucket_name $BUCKET_PATH

# Model should exit with status "SUCCEEDED"
cmd = 'gcloud ai-platform jobs describe $JOB_NAME --format="value(state)"'
for i in range(20):
    time.sleep(10)
    !{cmd}

env: JOB_NAME=census_income_job_20201113_160033
Job [census_income_job_20201113_160033] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe census_income_job_20201113_160033

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs census_income_job_20201113_160033
jobId: census_income_job_20201113_160033
state: QUEUED
PREPARING
PREPARING
PREPARING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
RUNNING
SUCCEEDED
SUCCEEDED
SUCCEEDED
SUCCEEDED


# Part 3: Upload your model to Google Cloud Storage
To use your model with AI Platform, it needs to be uploaded to Google Cloud Storage (GCS). When the state reached "SUCCESS" above, the model (model.bst) was copied to *student bucket*/census-income/model. (Check out the /trainer/task.py for details.)

Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named “model.joblib” for sklearn, “model.pkl” for custom prediction routines, or “model.bst” for XGBoost. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.

Now we the PROJECT ID for the *gcloud* command line utility for the next step.

In [16]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


# Part 4: Create a model resource
AI Platform organizes your trained models using model and version resources. An AI Platform model is a container for the versions of your machine learning model. For more information on model resources and model versions look [here](https://cloud.google.com/ml-engine/docs/deploying-models#creating_a_model_version). 

At this step, you create a container that you can use to hold several different versions of your actual model.

These variables will be needed for the remaining steps of the exercise.

In the cell below, **replace** the following highlighted elements:

* `<YOUR_MODEL_NAME>` - with your model name, such as "census"

In [17]:
# model_name = '<MODEL_NAME>'
model_name = 'census_income_student_3'
%env MODEL_NAME=$model_name
%env FRAMEWORK=xgboost

env: MODEL_NAME=census_income_student_3
env: FRAMEWORK=xgboost


In [18]:
! gcloud ai-platform models create $MODEL_NAME --regions $REGION

Using endpoint [https://ml.googleapis.com/]
Created ml engine model [projects/ai-fulcrum-admin/models/census_income_student_3].


# Part 5: Create a model version

In the cell below, **replace** the following highlighted elements:

* `<YOUR_VERSION>` - with your version name, such as "v1"

In [19]:
%env VERSION_NAME=v1

env: VERSION_NAME=v1


Now it’s time to get your model online and ready for predictions. The model version requires a few components as specified [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions#Version).

* __version_name__ - The name specified for the version when it was created. This will be the `VERSION_NAME` variable you declared at the beginning.
* __origin__ - Is where the trained model is located in Google Cloud Storage
* __runtime version__ - The Google Cloud ML runtime version to use for this deployment.
* __framework__ - The framework specifies if you are using: `TENSORFLOW`, `SCIKIT_LEARN`, `XGBOOST`. This is set to `XGBOOST`
* __python version__ - Python 3.7 is the only version of Python available for training and online prediction with runtime version 2.1. (The one we are using.)

Note: Runtime version 2.1 uses XGBoost 0.9. Please refer to the [runtime version dependency list](https://cloud.google.com/ml-engine/docs/runtime-version-list).

Note: It can take several minutes for you model to be available.

In [20]:
!gcloud ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --origin gs://${BUCKET_PATH}/model/ \
  --runtime-version $RUNTIME_VERSION \
  --python-version $PYTHON_VERSION \
  --framework $FRAMEWORK

Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......done.                    


# Part 6: Make an online prediction

It’s time to make  prediction with your newly deployed model. For making the online predictions we will be using the json array that we created. There are two ways demonstrated to make online predictions: using Gcloud and using Python.

## Use Google Cloud to make online predictions
Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.

| **Person** | age | workclass | fnlwgt | education | education-num | marital-status | occupation |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
| **1** | 25| 4 | 226802 | 1 | 7 | 4 | 7 |
| **2** | 28| 2 | 336951 | 7 | 12 | 2 | 11 |

| **Person** | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country || (Label) income|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:||:-:
| **1** | 3 | 2 | 1 | 0 | 0 | 40 | 38 || False (<=50K) |
| **2** | 0 | 4 | 1 | 0 | 0 | 40 | 38 || True (>50K) |


Creating a model version can take several minutes, check the status of your model version to see if it is available.

In [21]:
! gcloud ai-platform versions list --model $MODEL_NAME

Using endpoint [https://ml.googleapis.com/]
NAME  DEPLOYMENT_URI                          STATE
v1    gs://maven-user1/census-income/model/  READY


Test the model with an online prediction using the data of a person who makes <=50K.

Note: If you see an error, the model from Part 4 may not be created yet as it takes several minutes for a new model version to be created.

## Use the command line to make online predictions

In [22]:
! gcloud ai-platform predict --model $MODEL_NAME --version $VERSION_NAME --json-instances less_than_50K.json

Using endpoint [https://ml.googleapis.com/]
[0.004633505828678608]


Test the model with an online prediction using the data of a person who makes >50K.

In [23]:
! gcloud ai-platform predict --model $MODEL_NAME --version $VERSION_NAME --json-instances more_than_50K.json

Using endpoint [https://ml.googleapis.com/]
[0.3486039638519287]


Realise how the cells above return floats instead of booleans. Let's deal with that below so the output type of the predictions match those of the test set labels. We'll set the prediction to True if it's greater than 0.5 and to False otherwise.

## Use Python to make online predictions
We'll test the model with the entire test set and print out some of the results.

Note: If you are running notebook server on Compute Engine, make sure to ["allow full access to all Cloud APIs"](https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes).

In [24]:
import googleapiclient.discovery
import os

PROJECT_ID = os.environ['PROJECT_ID']
VERSION_NAME = os.environ['VERSION_NAME']
MODEL_NAME = os.environ['MODEL_NAME']

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
    print (response['error'])
else:
    online_results = response['predictions']
    # convert floats to booleans
    converted_responses = [x > 0.5 for x in online_results]
    # Print the first 10 responses
    for i, response in enumerate(converted_responses[:5]):
        print('Prediction: {}\tLabel: {}'.format(response, test_labels[i]))

Prediction: False	Label: False
Prediction: False	Label: False
Prediction: False	Label: True
Prediction: True	Label: True
Prediction: False	Label: False


# [Optional] Part 7: Verify Results
Let's visualise our predictions with a confusion matrix.

In [25]:
actual = pd.Series(test_labels, name='actual')
online = pd.Series(converted_responses, name='online')

pd.crosstab(actual,online)

online,False,True
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,11790,645
True,1451,2395


Let's compare this with the confusion matrix of our local model.

In [26]:
import xgboost as xgb

bst = xgb.Booster()  # init model
bst.load_model('./model.bst')  # load data

dtest = xgb.DMatrix(test_features)
local_results = bst.predict(dtest)
converted_local = [x > 0.5 for x in local_results]
local = pd.Series(converted_local, name='local')

pd.crosstab(actual, local)

local,False,True
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,11790,645
True,1451,2395


Better, let's compare the raw results (pre-boolean-conversion) of our local and online models.

In [27]:
identical = 0
different = 0

for i in range(len(online_results)):
    if online_results[i] == local_results[i]:
        identical += 1
    else:
        different += 1
        
print('Identical: {}, Different: {}'.format(identical, different))

Identical: 16281, Different: 0


If all results are identical, it means we've successfully uploaded our local model to AI Platform and performed online predictions correctly.