 ============================================================================== \
 Copyright 2020 Google LLC. This software is provided as-is, without warranty \
 or representation for any use or purpose. Your use of it is subject to your \
 agreement with Google. \
 ============================================================================== 
 
 Author: Elvin Zhu, Chanchal Chatterjee \
 Email: elvinzhu@google.com \
<img src="img/google-cloud-icon.jpg" alt="Drawing" style="width: 200px;"/>

Install pakcages requried for training, deployment and prediction with ai platform.

https://cloud.google.com/ai-platform/training/docs/runtime-version-list

In [None]:
%%bash
cd /home/jupyter/tuti-repo/ai-platform-xgboost
python3 -m pip install -r ./requirements.txt --user

### Create training application package

The easiest (and recommended) way to create a training application package uses gcloud to package and upload the application when you submit your training job. This method allows you to create a very simple file structure. For this tutorial, the file structure of your training application package should appear similar to the following:

```
config/
    config.yaml
    config_hpt.yaml
    
trainer/ 
    __init__.py
    train.py
    train_hpt.py
```




In [None]:
%%writefile ./setup.py

# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'tensorflow==2.1.0',
    'numpy==1.18.0',
    'pandas==1.2.1',
    'scipy==1.4.1',
    'scikit-learn==0.22',
    'google-cloud-storage==1.23.0',
    'xgboost==1.3.3',
    'cloudml-hypertune',
    ]
 
setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Trainer package for XGBoost Task'
)


In [None]:
%%writefile ./trainer/__init__.py

# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================



Create your training code (Example showed here is to use XGBoost to classify structured mortgage data)

In [None]:
%%writefile ./trainer/train.py


# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================

import argparse
import hypertune
import os
import subprocess
import sys
import pandas as pd
from sklearn import metrics
from xgboost import XGBClassifier

def train_xgboost(args):
    """ Train a XGBoost model
    Args:
        args: structure with the following field:
            bucket_name, str, gcs bucket name to store trained model
            blob_name, str, gcs blob name to store trained model
            train_feature_name, str, name of the train feature csv
            train_label_name, str, name of train label csv
            no_classes, int, number of prediction classes in the model
            n_estimators, int, number of estimators (hypertune)
            max_depth, int, maximum depth of trees (hypertune)
            booster, str, type of boosters (hypertune)
    Return:
        xgboost model object
    
    """
    
    x_train = pd.read_csv(args.train_feature_name)
    y_train = pd.read_csv(args.train_label_name)
   
    # ---------------------------------------
    # Train model
    # ---------------------------------------

    params = {
        'n_estimators': args.n_estimators,
        'max_depth': args.max_depth,
        'booster': args.booster,
        'min_child_weight': 1,
        'learning_rate': 0.1,
        'gamma': 0,
        'subsample': 1,
        'colsample_bytree': 1,
        'reg_alpha': 0,
        'objective': 'multi:softprob',
        'num_class': args.no_classes,
        }
    xgb_model = XGBClassifier(**params, use_label_encoder=False)
    xgb_model.fit(x_train, y_train)

    # ---------------------------------------
    # Save the model to local
    # ---------------------------------------

    temp_name = './model.bst'
    bst = xgb_model.get_booster()
    bst.save_model(temp_name)
    
    # ---------------------------------------
    # Move local model to gcs
    # ---------------------------------------
    
    target_path = os.path.join(args.job_dir, 'model.bst')
    if temp_name != target_path:
        subprocess.check_call(['gsutil', 'cp', temp_name, target_path],
            stderr=sys.stdout)

    return xgb_model
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--job-dir", type=str, help="Required by ai platform training", default='./')
    parser.add_argument("--train_feature_name", type=str, help="Path to training feature csv file")
    parser.add_argument("--train_label_name", type=str, help="Path to training label csv file")
    parser.add_argument("--no_classes", type=int, help="Number of target classes in the label")
    parser.add_argument("--n_estimators", type=int, help="Number of estimators in the xgboost model")
    parser.add_argument("--max_depth", type=int, help="Maximum depth of trees in xgboost")
    parser.add_argument("--booster", type=str, help="Type of booster")
    args = parser.parse_args()
    model = train_xgboost(args)

Create another version of training script which implement metric reporting summary for hyperparameter tuning

In [None]:
%%writefile ./trainer/train_hpt.py

# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================

import argparse
import hypertune
import os
import subprocess
import sys
import pandas as pd
from sklearn import metrics
from xgboost import XGBClassifier

from sklearn import preprocessing
import hypertune

def train_xgboost(args):
    """ Train a XGBoost model
    Args:
        args: structure with the following field:
            bucket_name, str, gcs bucket name to store trained model
            blob_name, str, gcs blob name to store trained model
            train_feature_name, str, name of the train feature csv
            train_label_name, str, name of train label csv
            no_classes, int, number of prediction classes in the model
            n_estimators, int, number of estimators (hypertune)
            max_depth, int, maximum depth of trees (hypertune)
            booster, str, type of boosters (hypertune)
    Return:
        xgboost model object
    
    """
    
    x_train = pd.read_csv(args.train_feature_name)
    y_train = pd.read_csv(args.train_label_name)
   
    # ---------------------------------------
    # Train model
    # ---------------------------------------

    params = {
        'n_estimators': args.n_estimators,
        'max_depth': args.max_depth,
        'booster': args.booster,
        'min_child_weight': 1,
        'learning_rate': 0.1,
        'gamma': 0,
        'subsample': 1,
        'colsample_bytree': 1,
        'reg_alpha': 0,
        'objective': 'multi:softprob',
        'num_class': args.no_classes,
        }
    xgb_model = XGBClassifier(**params, use_label_encoder=False)
    print(x_train.shape)
    print(y_train.shape)
    xgb_model.fit(x_train, y_train)

    # ---------------------------------------
    # Save the model to local
    # ---------------------------------------

    temp_name = 'model.bst'
    bst = xgb_model.get_booster()
    bst.save_model(temp_name)
    
    # ---------------------------------------
    # Move local model to gcs
    # ---------------------------------------
    
    subprocess.check_call(['gsutil', 'cp', temp_name, os.path.join(args.job_dir, 'model.bst')],
        stderr=sys.stdout)

    return xgb_model

def test_xgboost(xgb_model, args):

    # Load test data
    x_val = pd.read_csv(args.val_feature_name)
    y_val = pd.read_csv(args.val_label_name)
    
    # Perform predictions
    pred_val = xgb_model.predict(x_val)
    
    # One-hot encoding class labels
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_val)
    y_val = lb.transform(y_val)
    pred_val = lb.transform(pred_val)

    # Define the score we want to use to evaluate the classifier on
    score = metrics.roc_auc_score(y_val, pred_val, average='macro')
    return score
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--job-dir", type=str, help="Required by ai platform training", default='./')
    parser.add_argument("--train_feature_name", type=str, help="Path to training feature csv file")
    parser.add_argument("--train_label_name", type=str, help="Path to training label csv file")
    parser.add_argument("--val_feature_name", type=str, help="Path to validation feature csv file")
    parser.add_argument("--val_label_name", type=str, help="Path to validation label csv file")
    parser.add_argument("--no_classes", type=int, help="Number of target classes in the label")
    parser.add_argument("--n_estimators", type=int, help="Number of estimators in the xgboost model")
    parser.add_argument("--max_depth", type=int, help="Maximum depth of trees in xgboost")
    parser.add_argument("--booster", type=str, help="Type of booster")
    args = parser.parse_args()

    xgb_model = train_xgboost(args)
    score = test_xgboost(xgb_model, args)
    
    # The default name of the metric is training/hptuning/metric. 
    # We recommend that you assign a custom name. The only functional difference is that 
    # if you use a custom name, you must set the hyperparameterMetricTag value in the 
    # HyperparameterSpec object in your job request to match your chosen name.
    # https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#HyperparameterSpec
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        metric_value=score,
        hyperparameter_metric_tag='roc_auc',
        global_step=1000
    )

### Configure for AI Platform Training
Create config file for Cloud AI Platform training

In [None]:
%%writefile ./config/config.yaml
# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================

#trainingInput:
#  scaleTier: CUSTOM
#  masterType: n1-highmem-8
#  masterConfig:
#    acceleratorConfig:
#      count: 1
#      type: NVIDIA_TESLA_T4

trainingInput:
  scaleTier: STANDARD-1


### Configure for Hyperparameter Tuning
Similarly create config file for Cloud AI Platform Hyperparameter tuning. Moreover, the hyperparameter search space is needed to be configured.

The supported hyperparameter types are listed in the job reference documentation. In the ParameterSpec object, you specify the type for each hyperparameter and the related value ranges as described in the following table:

|Type        | Value ranges        |Value data            |
|------------|---------------------|----------------------|
|DOUBLE      |minValue & maxValue  | Floating-point values|
|INTEGER     |minValue & maxValue  |Integer values        |
|CATEGORICAL |categoricalValues    |List of category strings|
|DISCRETE    |discreteValues       |List of values in ascending order|


In [None]:
%%writefile ./config/config_hpt.yaml

# python3
# ==============================================================================
# Copyright 2020 Google LLC. This software is provided as-is, without warranty
# or representation for any use or purpose. Your use of it is subject to your
# agreement with Google.
# ==============================================================================


# hptuning_config.yaml
trainingInput:
  scaleTier: STANDARD-1 
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 5
    maxParallelTrials: 5
    hyperparameterMetricTag: roc_auc
    enableTrialEarlyStopping: TRUE
    params:
      - parameterName: max_depth
        type: INTEGER
        minValue: 3
        maxValue: 8
      - parameterName: n_estimators
        type: INTEGER
        minValue: 50
        maxValue: 200
      - parameterName: booster
        type: CATEGORICAL
        categoricalValues: [
          "gbtree",
          "gblinear",
          "dart"
        ]
