# Module 4: Model training

In module 4, you use SageMaker XGBoost algorithm to train a simple binary classification model using Amazon SageMaker open source XGBoost container (https://github.com/aws/sagemaker-xgboost-container). Using XGBoost as a framework provides more flexibility than using it as a built-in algorithm as it enables more advanced scenarios that allow pre-processing and post-processing scripts or any kind of custom logic to be incorporated into your training script.

For the training, you will use the pre-processed data generated by the processing job in the previous step .

Import the modules and initialise session variables.

In [None]:
import sagemaker
import boto3

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
prefix = 'end-to-end-ml'

print(region)
print(role)
print(bucket_name)

In [None]:
%store -r experiment_name

print(experiment_name)

## Training

The training code is implemented in the `source_dir/training.py` file.

The script parses arguments that are passed when the XGBoost Docker container code invokes the script for execution. These arguments represent the hyperparameters that you specify when strarting the training job plus the location of training and validation data. Then, we load training and validation data and execute XGBoost training with the provided parameters.

<strong>Note</strong>: this behavior, named Script Mode execution, is enabled by a library that is installed in the XGBoost container (sagemaker-training-toolkit, https://github.com/aws/sagemaker-training-toolkit) and facilitates the development of SageMaker-compatible Docker containers.

In [None]:
!pygmentize source_dir/training.py

Once we have our script ready, we can leverage on the XGBoost estimator of the Amazon SageMaker Python SDK to start training.

In [None]:
from sagemaker.xgboost import XGBoost

hyperparameters = {
    "max_depth": "3",
    "eta": "0.1",
    "gamma": "0",
    "min_child_weight": "1",
    "silent": "0",
    "objective": "binary:logistic",
    "num_round": "10",
    "eval_metric": "auc"
}

entry_point='training.py'
source_dir='source_dir/'
output_path = 's3://{0}/{1}/output/'.format(bucket_name, prefix)
code_location = 's3://{0}/{1}/code'.format(bucket_name, prefix)

estimator = XGBoost(
    base_job_name="end-to-end-ml-sm-xgb",
    entry_point=entry_point,
    source_dir=source_dir,
    output_path=output_path,
    code_location=code_location,
    hyperparameters=hyperparameters,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    framework_version="0.90-2",
    py_version="py3",
    role=role
)

In [None]:
from sagemaker.experiments.run import Run
import time

run_name=f'training-{time.strftime("%H-%M-%S", time.localtime())}'
run_display_name="xgboost-training"

train_config = sagemaker.TrainingInput('s3://{0}/{1}/data/preprocessed/train/'.format(
    bucket_name, prefix), content_type='text/csv')
val_config = sagemaker.TrainingInput('s3://{0}/{1}/data/preprocessed/val/'.format(
    bucket_name, prefix), content_type='text/csv')

with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    run_display_name=run_display_name,
    sagemaker_session=sagemaker_session,
) as run:

    estimator.fit({'train': train_config, 'validation': val_config})

### Experiment analytics

As before, you can visualize experiment analytics either from Amazon SageMaker Studio Experiments plug-in ([learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-view-compare.html)) or using the SDK from a notebook, as follows:

In [None]:
from sagemaker.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(experiment_name=experiment_name)
analytics.dataframe()

## You have completed module 4

You have completed the training step. The serialized model is now saved in Amazon S3 in the `output_location` defined above.

Proceed to module 5 to deploy the model.