# Model Training and Tuning

In this notebook we will train the model on the data we prepared in [Module 1: Preprocessing](../01_preprocessing/data_preprocessing.ipynb) using the AWS-managed Tensorflow container and a script describing the model used for classification.

## Import modules and initialize parameters for this notebook

In [None]:
import sagemaker
from sagemaker import get_execution_role
import boto3

role = get_execution_role()
sagemaker_session = sagemaker.Session()
boto3_session = boto3.Session()
sagemaker_client = boto3_session.client("sagemaker")

account = sagemaker_session.account_id()
region = sagemaker_session.boto_region_name
default_bucket = sagemaker_session.default_bucket() # or use your own custom bucket name
prefix = 'cv-sagemaker-immersionday'

In [None]:
processed_data_s3_uri = f's3://{default_bucket}/{prefix}/outputs'
model_output_path = f's3://{default_bucket}/{prefix}/outputs/model'

## Automatic Model Tuning

[Amazon SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

### Configure HPO Job
Next, the tuning job with the following configurations need to be specified:
- hyperparameters that SageMaker Automatic Model Tuning will tune: `dropout`, `batch-size`;
- maximum number of training jobs it will run to optimize the objective metric: `6`
- number of parallel training jobs that will run in the tuning job: `2`
- the objective metric that Automatic Model Tuning will use is the accuracy of the validation data: `val_acc`

In [None]:
metric_definitions = [
    {'Name': 'loss',      'Regex': 'loss: ([0-9\\.]+)'},
    {'Name': 'acc',       'Regex': 'accuracy: ([0-9\\.]+)'},
    {'Name': 'val_loss',  'Regex': 'val_loss: ([0-9\\.]+)'},
    {'Name': 'val_acc',   'Regex': 'val_accuracy: ([0-9\\.]+)'}
]

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.tensorflow import TensorFlow

TF_FRAMEWORK_VERSION = '2.4.1'
DISTRIBUTION = {'parameter_server': {'enabled': False}}
DISTRIBUTION_MODE = 'FullyReplicated'
    
training_instance_type = 'ml.c5.4xlarge'
training_instance_count = 1
shared_hyperparameters = { "initial_epochs": 5, 'fine_tuning_epochs': 20, 'data_dir': '/opt/ml/input/data' }

estimator = TensorFlow(
    entry_point="train-mobilenet.py",
    source_dir="code",
    instance_type=training_instance_type,
    instance_count=training_instance_count,   
    hyperparameters=shared_hyperparameters,    
    metric_definitions=metric_definitions,     
    role=role,
    framework_version=TF_FRAMEWORK_VERSION, 
    py_version='py37',     
    base_job_name=prefix,
    script_mode=True
)

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "dropout": ContinuousParameter(0.5, 0.8),
    "batch-size": CategoricalParameter([8 , 16, 32, 64, 128, 256])
}

objective_metric_name = "val_acc"

train_in = TrainingInput(s3_data=processed_data_s3_uri +'/train', distribution=DISTRIBUTION_MODE)
val_in   = TrainingInput(s3_data=processed_data_s3_uri +'/valid', distribution=DISTRIBUTION_MODE)
test_in  = TrainingInput(s3_data=processed_data_s3_uri +'/test', distribution=DISTRIBUTION_MODE)

inputs = {'train':train_in, 'test': test_in, 'validation': val_in}

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions=metric_definitions,
    objective_type="Maximize",
    max_jobs=2, # Low because of demo purposes
    max_parallel_jobs=2,
    base_tuning_job_name="cv-hpo",
)

tuner.fit(inputs)

## Viewing the experiment associated to the HPO job

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) helps you organize, track, compare and evaluate machine learning (ML) experiments and model versions. SInce ML is a highly iterative process, Experiment helps data scientists and ML engineers to explore thousands of different models in an organized manner.  Exspecially when you are using tools like [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) and [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html), it will help you explore a large number of combinations automatically, and quickly zoom in on high-performance models.

The tuning job has an experiment automatically associated with it and we can explore the results in the SageMaker user interface. 

![hpo-experiment](statics/hpo-experiment.png)

You can also access the best model programatically as shown below:

In [None]:
best_model_uri = tuner.best_estimator().latest_training_job.describe()['ModelArtifacts']['S3ModelArtifacts']

print(f"\nBest model artifact file is uploaded here: {best_model_uri}")

Copy the best model into the prefix used for this project. Note that's possible to directly reference the tuner's output, as above, but given the modularity of this workshop, the artefacts are copied to a known location in S3.

In [None]:
!aws s3 cp {best_model_uri} {model_output_path}/model.tar.gz