# Train a model & track your experiments

In this notebook you will prepare features needed for training from a raw dataset and train an XGBoost model. The metrics and parameters associated with each training run will be tracked in a SageMaker Experiment.

In [2]:
!pip install sagemaker==2.117
!pip install sagemaker-experiments

[0mCollecting sagemaker-experiments
  Using cached sagemaker_experiments-0.1.42-py3-none-any.whl (42 kB)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.42
[0m

In [3]:
import sagemaker
import json
import boto3
import numpy as np                                
import pandas as pd                               
import os                                         
from sagemaker import get_execution_role
from datetime import datetime

# Get user profile name
metadataFile = open('/opt/ml/metadata/resource-metadata.json')
metadata = json.load(metadataFile)
userprofileName = metadata['UserProfileName']

# Get default bucket
bucket = sagemaker.Session().default_bucket()
prefix = f'sagemaker/{userprofileName}/mlops-workshop'

# Get SageMaker Execution Role
role = get_execution_role()
region = boto3.Session().region_name

# SageMaker Session
sagemaker_session = sagemaker.session.Session()

### Retrieve variables from previous module

In [4]:
%store -r

In [5]:
print(train_uri)
print(test_uri)
print(val_uri)

s3://sagemaker-ca-central-1-222848388999/sklearn-marketing-process-pplhy997-inta-2022-12-15-16-23-27-984/output/train
s3://sagemaker-ca-central-1-222848388999/sklearn-marketing-process-pplhy997-inta-2022-12-15-16-23-27-984/output/test
s3://sagemaker-ca-central-1-222848388999/sklearn-marketing-process-pplhy997-inta-2022-12-15-16-23-27-984/output/validation


## Training

To train a model in SageMaker, you create a training job. The training job includes the following information:

* The Amazon Elastic Container Registry path where the training code is stored.
* The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.
* The compute resources that you want SageMaker to use for model training. Compute resources are ML compute instances that are managed by SageMaker.
* The URL of the S3 bucket where you want to store the output of the job.

SageMaker built-in algorithms require the least effort and scale if the data set is large and significant resources are needed to train and deploy the model. For this use case, we will use the built-in xgboost algorithm in SageMaker.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

In [6]:
image_uri = sagemaker.image_uris.retrieve(region=region, framework='xgboost', version='latest')

## Create an Experiment

To ensure we are able to keep track of our parameters and metrics that correspond to the training job, we create an Experiment and add this Training job to a Trial within that Experiment. 

Experiments are organized as -
```
Experiment
    Trial
        Trial Component 1
        Trial Component 2
        ...
```     
In this notebook, each time we run the Training job, it will correspond to a Trial Component and we organize that into Trials that represent each iterative experiment we run. 

In [7]:
current_time = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

### Create the Experiment

In [8]:
from smexperiments.experiment import Experiment

sm = boto3.client('sagemaker')
xgboost_experiment = Experiment.create(experiment_name=f'xgboost-banking-dataset-experiment-{current_time}')

### Create the Trial

In [9]:
trial = xgboost_experiment.create_trial(trial_name=f'trial-{current_time}')

An estimator is a high level interface for SageMaker training. We will create an estimator object by supplying the required parameters, such as IAM role, compute instance count and type. and the S3 output path. 

We also supply hyperparameters for the algoirthm and then call its fit() method to start training the model.

In [10]:
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

model_path = f"s3://{bucket}/{prefix}/xgb_model"

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    output_path=model_path,
    role=role,
    sagemaker_session=sagemaker_session
)
xgb_train.set_hyperparameters(
    objective="binary:logistic",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0
)

xgb_train.fit(
    inputs = {
        "train": TrainingInput(
            s3_data=train_uri,
            content_type="text/csv"
        ),
        'validation': TrainingInput(
            s3_data=val_uri,
            content_type="text/csv"
        )
    },
    experiment_config = {
        "ExperimentName": xgboost_experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "XGB-Training"
    }
) 

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: xgboost-2022-12-15-16-44-29-583


2022-12-15 16:44:29 Starting - Starting the training job...ProfilerReport-1671122669: InProgress
...
2022-12-15 16:45:19 Starting - Preparing the instances for training.........
2022-12-15 16:46:55 Downloading - Downloading input data
2022-12-15 16:46:55 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2022-12-15:16:47:03:INFO] Running standalone xgboost training.[0m
[34m[2022-12-15:16:47:03:INFO] File size need to be processed in the node: 4.23mb. Available memory size in the node: 8296.21mb[0m
[34m[2022-12-15:16:47:03:INFO] Determined delimiter of CSV input is ','[0m
[34m[16:47:03] S3DistributionType set as FullyReplicated[0m
[34m[16:47:04] 28831x60 matrix with 1729860 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-12-15:16:47:04:INFO] Determined delimiter of CSV input is ','[0m
[34m[16:47:04] S3DistributionType set as FullyReplicated[0m
[34m[16:47:04] 6178x60 matrix with

In [None]:
trained_model_uri = xgb_train.model_data

In [None]:
training_image = xgb_train.image_uri

In [None]:
%store trained_model_uri
%store training_image

#### You can now move to the next section of the module `Track all models in a model registry`

The notebook used in that section is `sagemaker-register.ipynb`