<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [4]</a>'.</span>

# Multiclass classification with Amazon SageMaker XGBoost algorithm
_**Single machine and distributed training for multiclass classification with Amazon SageMaker XGBoost algorithm**_

---

---
## Contents

1. [Introduction](#Introduction)
2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)
  1. [Permissions and environment variables](#Permissions-and-environment-variables)
  2. [Data ingestion](#Data-ingestion)
  3. [Data conversion](#Data-conversion)
3. [Training the XGBoost model](#Training-the-XGBoost-model)
  1. [Training on a single instance](#Training-on-a-single-instance)
  2. [Training on multiple instances](#Training-on-multiple-instances)
4. [Set up hosting for the model](#Set-up-hosting-for-the-model)
  1. [Import model into hosting](#Import-model-into-hosting)
  2. [Create endpoint configuration](#Create-endpoint-configuration)
  3. [Create endpoint](#Create-endpoint)
5. [Validate the model for use](#Validate-the-model-for-use)

---
## Introduction


This notebook demonstrates the use of Amazon SageMaker’s implementation of the XGBoost algorithm to train and host a multiclass classification model. The MNIST dataset is used for training. It has a training set of 60,000 examples and a test set of 10,000 examples. To illustrate the use of libsvm training data format, we download the dataset and convert it to the libsvm format before training.

To get started, we need to set up the environment with a few prerequisites for permissions and configurations.

---
## Prequisites and Preprocessing
This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel. 

### Permissions and environment variables

Here we set up the linkage and authentication to AWS services.

1. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.
2. The S3 buckets that you want to use for training and model data and where the downloaded data is located.

In [None]:
%%time
import pprint
import os
import boto3
import json
import re
import copy
import time
from time import gmtime, strftime
import sagemaker
from aws_orbit_sdk.common import get_workspace
workspace = get_workspace()
role = workspace['EksPodRoleArn']
pprint.pprint(workspace)

In [None]:
env_name = %env AWS_ORBIT_ENV
team_name = %env AWS_ORBIT_TEAM_SPACE
user_name = %env USERNAME
namespace = %env AWS_ORBIT_USER_SPACE
(env_name,team_name, user_name, namespace)

In [13]:
# Parameters
PAPERMILL_INPUT_PATH = "/tmp/e1@20210505-15:02.ipynb"
PAPERMILL_OUTPUT_PATH = "shared/regression/notebooks/H-Model-Development/Example-1-SageMaker-xgboost_mnist/e1@20210505-15:02.ipynb"
PAPERMILL_OUTPUT_DIR_PATH = (
    "shared/regression/notebooks/H-Model-Development/Example-1-SageMaker-xgboost_mnist"
)
PAPERMILL_WORKBOOK_NAME = "e1@20210505-15:02.ipynb"
PAPERMILL_WORK_DIR = "/home/jovyan/shared/samples/notebooks/H-Model-Development"


<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [None]:
#env_name = %env AWS_ORBIT_ENV
env_name = workspace["env_name"]
region = workspace["region"]
ssm_parameter_name = (f"/orbit/{env_name}/demo")
ssm_parameter_name

In [None]:
env_name = workspace["env_name"]
ssm_parameter_name = (f"/orbit/{env_name}/demo")

#ssm_parameter_name = "/orbit/dev-env/demo"
ssm_client = boto3.client(service_name="ssm")
demo_json = json.loads(ssm_client.get_parameter(Name=ssm_parameter_name)["Parameter"]["Value"])
demo_json

In [16]:
# Get demo env bucket name from ssm parameter. 
# S3 bucket for saving code and model artifacts.
bucket = demo_json["LakeBucket"].split(":")[-1]

In [None]:
!aws s3 ls s3://$bucket/landing/data/sagemaker/

In [18]:
prefix = "sagemaker/DEMO-xgboost-multiclass-classification"
# customize to your bucket where you have stored the data
bucket_path = f"s3://{bucket}"

In [None]:
bucket_path 

In [20]:
## Lets check to see if MNIST is already staged.  If so, we will jus reuse what is staged....who wants to wait for that dataset to by uploaded???

In [None]:

s3_train = f"{bucket_path}/{prefix}/train/"
s3_validation = f"{bucket_path}/{prefix}/validation/"
s3_test = f"{bucket_path}/{prefix}/test/"
skip_mnist_load = False

s3_train_chk = !aws s3 ls {s3_train}
s3_validation_chk = !aws s3 ls {s3_validation}
s3_test_chk = !aws s3 ls {s3_test}

print(len(s3_train_chk))
print(len(s3_validation_chk))
print(len(s3_test_chk))

if len(s3_train_chk)>0 and len(s3_validation_chk)>0 and len(s3_test_chk)>0:
    skip_mnist_load = True

skip_mnist_load



### Data ingestion

Next, we read the MNIST dataset [1] from an existing repository into memory, for preprocessing prior to training. It was downloaded from this [link](http://deeplearning.net/data/mnist/mnist.pkl.gz) and stored in `downloaded_data_bucket`. Processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

> [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998.

In [None]:
# Import Data from toolkit bucket. 
mnist_data_path = f"s3://{bucket}/landing/data/sagemaker/mnist.pkl.gz"
mnist_data_path

In [None]:
!aws s3 ls $mnist_data_path

In [24]:
if not skip_mnist_load:
    !aws s3 cp $mnist_data_path ./

In [None]:
%%time
import pickle, gzip, numpy, urllib.request, json
if not skip_mnist_load:
    f = gzip.open('mnist.pkl.gz', 'rb')
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    f.close()

### Data conversion

Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the data is converted from pickle-ized numpy array to the libsvm format before being uploaded to S3. The hosted implementation of xgboost consumes the libsvm converted data from S3 for training. The following provides functions for data conversions and file upload to S3 and download from S3. 

In [None]:
%%time

import struct
import io
import boto3

 
def to_libsvm(f, labels, values):
     f.write(bytes('\n'.join(
         ['{} {}'.format(label, ' '.join(['{}:{}'.format(i + 1, el) for i, el in enumerate(vec)])) for label, vec in
          zip(labels, values)]), 'utf-8'))
     return f


def write_to_s3(fobj, bucket, key):
    return (
        boto3.Session(region_name=region).resource("s3").Bucket(bucket).Object(key).upload_fileobj(fobj)
    )


def get_dataset():
    import pickle
    import gzip

    with gzip.open("mnist.pkl.gz", "rb") as f:
        u = pickle._Unpickler(f)
        u.encoding = "latin1"
        return u.load()


def upload_to_s3(partition_name, partition):
    labels = [t.tolist() for t in partition[1]]
    vectors = [t.tolist() for t in partition[0]]
    num_partition = 5  # partition file into 5 parts
    partition_bound = int(len(labels) / num_partition)
    for i in range(num_partition):
        f = io.BytesIO()
        to_libsvm(
            f,
            labels[i * partition_bound : (i + 1) * partition_bound],
            vectors[i * partition_bound : (i + 1) * partition_bound],
        )
        f.seek(0)
        key = f"{prefix}/{partition_name}/examples{str(i)}"
        url = f"s3://{bucket}/{key}"
        print(f"Writing to {url}")
        write_to_s3(f, bucket, key)
        print(f"Done writing to {url}")


def download_from_s3(partition_name, number, filename):
    key = f"{prefix}/{partition_name}/examples{number}"
    url = f"s3://{bucket}/{key}"
    print(f"Reading from {url}")
    s3 = boto3.resource("s3", region_name=region)
    s3.Bucket(bucket).download_file(key, filename)
    try:
        s3.Bucket(bucket).download_file(key, "mnist.local.test")
    except botocore.exceptions.ClientError as e:
        if e.response["Error"]["Code"] == "404":
            print(f"The object does not exist at {url}.")
        else:
            raise


def convert_data():
    train_set, valid_set, test_set = get_dataset()
    partitions = [("train", train_set), ("validation", valid_set), ("test", test_set)]
    for partition_name, partition in partitions:
        print(f"{partition_name}: {partition[0].shape} {partition[1].shape}")
        upload_to_s3(partition_name, partition)

In [None]:
%%time
if not skip_mnist_load:
    convert_data()

In [28]:
import random, string
unique_id = ''.join(random.choices(string.ascii_lowercase + string.digits, k=4))
job_yaml = f"xgboost-mnist-sm-{unique_id}.yaml"
job_name = f"xgboost-mnist-sm-operator-{unique_id}"


s3_train = f"{bucket_path}/{prefix}/train/"
s3_validation = f"{bucket_path}/{prefix}/validation/"
s3_models = f"{bucket_path}/{prefix}/models/"


output_bucket = s3_models
training_data_bucket = s3_train
validation_data_bucket = s3_validation


In [29]:
job_definition= f'''apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  name: {job_name}
spec:
  roleArn: {role}  
  region: {region}
  algorithmSpecification:
    trainingImage: 433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest
    trainingInputMode: File
  outputDataConfig:
    s3OutputPath: {output_bucket}
  inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: {training_data_bucket}
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: {validation_data_bucket}
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
  resourceConfig:
    instanceCount: 1
    instanceType: ml.m4.xlarge
    volumeSizeInGB: 5
  hyperParameters:
    - name: max_depth
      value: "5"
    - name: eta
      value: "0.2"
    - name: gamma
      value: "4"
    - name: min_child_weight
      value: "6"
    - name: silent
      value: "0"
    - name: objective
      value: multi:softmax
    - name: num_class
      value: "10"
    - name: num_round
      value: "10"
  stoppingCondition:
    maxRuntimeInSeconds: 86400'''

In [30]:
f = open(job_yaml,"w")
f.write(job_definition)
f.close()

In [None]:
current_job = !kubectl apply -f {job_yaml} -n {team_name}
current_job

In [None]:
import time
print(job_name)
job_status = !kubectl describe trainingjob {job_name} -n {team_name}
j_s = job_status.grep("Training Job Status")[0]
if 'InProgress'in j_s  or 'SynchronizingK8sJobWithSageMaker' in j_s or 'ReconcilingTrainingJob' in j_s :
    while True:
        job_status = !kubectl describe trainingjob {job_name} -n {team_name}
        j_s = job_status.grep("Training Job Status")[0]
        if 'InProgress'in j_s  or 'SynchronizingK8sJobWithSageMaker' in j_s :
            print(job_status.grep('Training Job Status'))
            time.sleep(10);
            continue
        else:
            break  
f_state = !kubectl describe trainingjob {job_name} -n {team_name}
f_state_s = f_state.grep("Training Job Status")[0]
print(f"Final State ---> {f_state_s}")


In [33]:
# Do an assert that ther f_state_s contains 'Completed'
j_s_final = job_status.grep("Training Job Status")[0]
if 'Completed' in j_s_final:
    print('Successful MNIST Training')
    

In [None]:
# Lets look a the SMLogs

!kubectl smlogs trainingjobs {job_name} -n {team_name}