## XGBoost Complete Project Workflow in Amazon SageMaker
### Model Training
    
1. [Local Mode training](#LocalModeTraining)
2. [SageMaker hosted training](#SageMakerHostedTraining)
3. [Automatic model tuning](#AutomaticModelTuning)

##  Local Mode training<a class="anchor" id="LocalModeTraining">
Local Mode in Amazon SageMaker is a convenient way to make sure your code is working locally as expected before moving on to full scale, hosted training in a separate, more powerful SageMaker-managed cluster. To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU instances) installed. Running the following commands will install docker-compose or nvidia-docker-compose, and configure the notebook environment for you.


Amazon SageMaker supports two ways to use the XGBoost algorithm:

- XGBoost built-in algorithm
- XGBoost open source algorithm

The XGBoost open source algorithm provides the following benefits over the built-in algorithm:

- Latest version - The open source XGBoost algorithm typically supports a more recent version of XGBoost. To see the XGBoost version that is currently supported, see XGBoost SageMaker Estimators and Models.
- Flexibility - Take advantage of the full range of XGBoost functionality, such as cross-validation support. You can add custom pre- and post-processing logic and run additional code after training.
- Scalability - The XGBoost open source algorithm has a more efficient implementation of distributed training, which enables it to scale out to more instances and reduce out-of-memory errors.
- Extensibility - Because the open source XGBoost container is open source, you can extend the container to install additional libraries and change the version of XGBoost that the container uses. For an example notebook that shows how to extend SageMaker containers, see Extending our PyTorch containers.

First, we'll import the variables stored from previous notebooks.

In [None]:
import numpy as np
import sagemaker
from parameter_store import ParameterStore

ps = ParameterStore()
parameters = ps.read()

bucket = parameters['bucket']
s3_prefix = parameters['s3_prefix']
raw_s3 = parameters['raw_s3']
train_dir = parameters['train_dir']
test_dir = parameters['test_dir']
train_dir_csv = parameters['train_dir_csv']
test_dir_csv = parameters['test_dir_csv']
role = parameters['role']
sess = sagemaker.Session()

In [None]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    
!/bin/bash ./local_mode_setup.sh

Next, we'll set up a XGBoost Estimator for Local Mode training. Key parameters for the Estimator include:

- `train_instance_type`: the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local` to invoke Local Mode training on the CPU, or to `local_gpu` if the instance has a GPU.  
- The algorithm’s hyperparameters, which are passed in as a dictionary. 

Recall that we are using Local Mode here mainly to make sure our code is working. Accordingly, instead of performing a full cycle of training with many epochs (passes over the full dataset), we'll train only for a small number of epochs just to confirm the code is working properly and avoid wasting full-scale training time unnecessarily.

In [None]:
import pandas as pd

x_train = np.load(f'{train_dir}/x_train.npy')
y_train = np.load(f'{train_dir}/y_train.npy')
x_test = np.load(f'{test_dir}/x_test.npy')
y_test = np.load(f'{test_dir}/y_test.npy')

train_df = pd.DataFrame(data=x_train)
train_df['target'] = y_train
first_col = train_df.pop('target')
train_df.insert(0, 'target', first_col)

test_df = pd.DataFrame(data=x_test)
test_df['target'] = y_test
first_col = test_df.pop('target')
test_df.insert(0, 'target', first_col)

train_df.to_csv('./data/train_csv/train.csv', header=False, index=False)
test_df.to_csv('./data/test_csv/test.csv', header=False, index=False)

In [None]:
from sagemaker.xgboost import XGBoost

train_instance_type = 'local'
hyperparameters = {'num_round': 6}

local_estimator_parameters = {'entry_point':'train_deploy.py',
                              'train_instance_type' : train_instance_type,
                              'train_instance_count': 1,
                              'hyperparameters': hyperparameters,
                              'role' : role,
                              'base_job_name':'xgboost-local-model',
                              'framework_version':'1.0-1',
                              'py_version':'py3'}

local_estimator = XGBoost(**local_estimator_parameters)
inputs = {'train': f'file://{train_dir_csv}',
          'test': f'file://{test_dir_csv}'}
local_estimator.fit(inputs)

In [None]:
local_model_data = local_estimator.model_data

##  Experiment tracking <a class="anchor" id="Experiment">
SageMaker experiments can track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

In [None]:
import sys

!{sys.executable} -m pip install sagemaker-experiments
!pip install --upgrade boto3==1.16.27
!pip install sagemaker==1.72.0

In [None]:
from smexperiments.experiment import Experiment
from time import gmtime, strftime
import boto3

sm = boto3.client('sagemaker')

xgboost_experiment = Experiment.create(
    experiment_name="boston-housing-regression-{}".format(strftime("%d-%H-%M-%S", gmtime())), 
    description="Boston housing price estimation.", 
    sagemaker_boto_client=sm)

##  SageMaker hosted training <a class="anchor" id="SageMakerHostedTraining">

Now that we've confirmed our code is working locally, we can move on to use SageMaker's hosted training functionality. Hosted training is preferred for doing actual training, especially large-scale, distributed training.  Unlike Local Mode training, for hosted training the actual training itself occurs not on the notebook instance, but on a separate cluster of machines managed by SageMaker.  Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We uploaded to S3 in the previous notebook, so we're good to go here.

Upload to S3

In [None]:
from sagemaker.session import s3_input
import os

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(s3_prefix, 'data/train/train.csv')).upload_file('./data/train_csv/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(s3_prefix, 'data/test/test.csv')).upload_file('./data/test_csv/test.csv')
s3_input_train_uri = f's3://{bucket}/{s3_prefix}/data/train/train.csv'
s3_input_test_uri = f's3://{bucket}/{s3_prefix}/data/test/test.csv'
s3_input_train = s3_input(s3_input_train_uri, content_type='csv')
s3_input_test = s3_input(s3_input_test_uri, content_type='csv')

Now we create the XGBoost model and pass in the required hyperparameter. Details of hyperparamters can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

We're now ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a SageMaker ML instance type instead of `local` for Local Mode. Also, since we know our code is working now, we'll train for a larger number of epochs with the expectation that model training will converge to an improved, lower validation loss.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator

container = get_image_uri(sess.boto_region_name, 'xgboost')
train_instance_type = 'ml.m4.xlarge'
hyperparameters = {'num_round': 8}
estimator = Estimator(container,
                      role=role,
                      train_instance_type=train_instance_type,
                      train_instance_count=1,
                      hyperparameters=hyperparameters,
                      base_job_name='xgboost-hosted-model')

inputs = {'train': s3_input_train, 'validation': s3_input_test}

After starting the hosted training job with the `fit` method call below, you should observe the training converge over the longer number of epochs to a validation loss that is considerably lower than that which was achieved in the shorter Local Mode training job.  Can we do better? We'll look into a way to do so in the **Automatic Model Tuning** section below. 

We will add the experiment name to the `fit()` call to group this training job as part of our experiment.

In [None]:
experiment_config={
    "ExperimentName": xgboost_experiment.experiment_name,
    "TrialComponentDisplayName": "Training",
}

estimator.fit(inputs, experiment_config=experiment_config)

Now that training finished, we can use SageMaker Experiments to examine the results and see how it compares to other training jobs within the experiment. Right now this is the only job captured in Experiments, but let's take a look anyway to see what data it stores.

In [None]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=xgboost_experiment.experiment_name,
    sort_order="Descending",
    metric_names=['validation:rmse']
)

columns = ['TrialComponentName', 'validation:rmse - Last', 'num_round']
trial_component_analytics.dataframe()[columns]

Another way to look at your training job analytics is by using the `sagemaker.analytics.TrainingJobAnalytics` class in the SageMaker Python SDK.

In [None]:
from sagemaker.analytics import TrainingJobAnalytics

metric_names = ['train:rmse', 'validation:rmse']
TrainingJobAnalytics(estimator.latest_training_job.name, metric_names=metric_names).dataframe()

In [None]:
training_job_name = estimator.latest_training_job.name
remote_model_data = estimator.model_data

As with the Local Mode training, hosted training produces a model saved in S3 that we can retrieve.  This is an example of the modularity of SageMaker: having trained the model in SageMaker, you can now take the model out of SageMaker and run it anywhere else.  Alternatively, you can deploy the model into a production-ready environment using SageMaker's hosted endpoints functionality, as shown in the **SageMaker hosted endpoint** section below.

Retrieving the model from S3 is very easy:  the hosted training estimator you created above stores a reference to the model's location in S3.  You simply copy the model from S3 using the estimator's `model_data` property and unzip it to inspect the contents.

In [None]:
estimator.model_data

In [None]:
!aws s3 cp {estimator.model_data} ./model/model.tar.gz

The unzipped archive should include the assets required by PyTorch to load the model and serve it:  

In [None]:
!tar -xvzf ./model/model.tar.gz -C ./model

## Managed Spot Training <a class="anchor" id="ManagedSpotTraining">
    
In this next example we will create a Hosted Training job with Managed Spot Training and Checkpointing enabled.
    
The Managed Spot Training Estimator is similar to the Host Training estimator, except we must add the following additional arugments: 
   * `train_use_spot_instances: True` (Boolean)
   * `train_max_run: 1200` (Integer - Seconds)
   * `train_max_wait: 2400` (Integer - Seconds)
   * `checkpoint_s3_uri: s3://{}/{}/checkpoint".format(bucket, s3_prefix)` (must match a valid s3 URI)
    
**Note:** `train_max_wait` must be equal or greater than or equal to `train_max_run` or you will get an arugment exception error.    

In [None]:
hyperparameters = {'num_round': 8}
estimator = Estimator(container,
                      role=role,
                      train_instance_type=train_instance_type,
                      train_instance_count=1,
                      hyperparameters=hyperparameters,
                      base_job_name='xgboost-hosted-model',
                      train_use_spot_instances=True,
                      train_max_run=1200,
                      train_max_wait=2400,
                      checkpoint_s3_uri=f's3://{bucket}/{s3_prefix}/checkpoint')

inputs = {'train': s3_input_train, 'validation': s3_input_test}

In [None]:
experiment_config={
    "ExperimentName": xgboost_experiment.experiment_name,
    "TrialComponentDisplayName": "Training",
}

In [None]:
estimator.fit(inputs=inputs,
              experiment_config=experiment_config)

In [None]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=xgboost_experiment.experiment_name,
    sort_order="Descending",
    metric_names=['val_loss']
)

trial_component_analytics.dataframe()

## Automatic Model Tuning <a class="anchor" id="AutomaticModelTuning">

So far we have simply run one Local Mode training job, one Hosted Training job and one Managed Spot Training job without any real attempt to tune hyperparameters to produce a better model, other than increasing the number of epochs.  Selecting the right hyperparameter values to train your model can be difficult, and typically is very time consuming if done manually. The right combination of hyperparameters is dependent on your data and algorithm; some algorithms have many different hyperparameters that can be tweaked; some are very sensitive to the hyperparameter values selected; and most have a non-linear relationship between model fit and hyperparameter values.  SageMaker Automatic Model Tuning helps automate the hyperparameter tuning process:  it runs multiple training jobs with different hyperparameter combinations to find the set with the best model performance.

We begin by specifying the hyperparameters we wish to tune, and the range of values over which to tune each one.  We also must specify an objective metric to be optimized:  in this use case, we'd like to minimize the validation loss.

In [None]:
inputs = {'train': s3_input_train, 'validation': s3_input_test}

In [None]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
  'num_round': IntegerParameter(2, 10),
  'alpha': ContinuousParameter(0, 2)
}

objective_metric_name = 'validation:rmse' #mae, map, auc, error, among others
objective_type = 'Minimize'

Next we specify a HyperparameterTuner object that takes the above definitions as parameters.  Each tuning job must be given a budget:  a maximum number of training jobs.  A tuning job will complete after that many training jobs have been executed.  

We also can specify how much parallelism to employ, in this case five jobs, meaning that the tuning job will complete after three series of five jobs in parallel have completed.  For the default Bayesian Optimization tuning strategy used here, the tuning search is informed by the results of previous groups of training jobs, so we don't run all of the jobs in parallel, but rather divide the jobs into groups of parallel jobs.  There is a trade-off: using more parallel jobs will finish tuning sooner, but likely will sacrifice tuning search accuracy. 

Now we can launch a hyperparameter tuning job by calling the `fit` method of the HyperparameterTuner object.  The tuning job may take around 10 minutes to finish.  While you're waiting, the status of the tuning job, including metadata and results for invidual training jobs within the tuning job, can be checked in the SageMaker console in the **Hyperparameter tuning jobs** panel.  

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(sess.boto_region_name, 'xgboost')
estimator = sagemaker.estimator.Estimator(container,
                              role=role,
                              train_instance_type='ml.m4.xlarge',
                              train_instance_count=1,
                              hyperparameters=hyperparameters)

In [None]:
tuner_parameters = {'estimator':estimator,
                    'objective_metric_name':objective_metric_name,
                    'hyperparameter_ranges':hyperparameter_ranges,
                    #'metric_definitions':metric_definitions,
                    'objective_type': objective_type,
                    'max_jobs':4,
                    'max_parallel_jobs':2}

tuner = HyperparameterTuner(**tuner_parameters)

tuning_job_name = "xgboost-tuning-{}".format(strftime("%d-%H-%M-%S", gmtime()))
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()

After the tuning job is finished, we can use the `HyperparameterTuningJobAnalytics` object from the SageMaker Python SDK to list the top 5 tuning jobs with the best performance. Although the results vary from tuning job to tuning job, the best validation loss from the tuning job (under the FinalObjectiveValue column) likely will be substantially lower than the validation loss from the hosted training job above, where we did not perform any tuning other than manually increasing the number of epochs once.  

In [None]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

The total training time and training jobs status can be checked with the following lines of code. Because automatic early stopping is by default off, all the training jobs should be completed normally.  For an example of a more in-depth analysis of a tuning job, see the SageMaker official sample [HPO_Analyze_TuningJob_Results.ipynb](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) notebook.

In [None]:
total_time = tuner_metrics.dataframe()['TrainingElapsedTimeSeconds'].sum() / 3600
print("The total training time is {:.2f} hours".format(total_time))
tuner_metrics.dataframe()['TrainingJobStatus'].value_counts()

We'll use the training artifacts created in this notebook in downstream notebooks for model deployment. Store them here for later retrieval.

In [None]:
ps.update({'local_model_data': local_model_data, 'remote_model_data': remote_model_data,
           'training_job_name': training_job_name, 'tuning_job_name': tuning_job_name,
           's3_input_train_uri': s3_input_train_uri, 's3_input_test_uri': s3_input_test_uri})

In [None]:
ps.store()