# Computer Vision (CV) On SageMaker - Pytorch

1. [Introduction](#Introduction)
2. [Prerequisites](#Prerequisites)
3. [Setup](#Setup)
4. [Dataset](#Dataset)
5. [Training a CV model](#Training-a-CV-model)
    1. [TFRecord Data Ingestion](#TFRecord-Data-Ingestion)
    2. [Create Experiment](#Create-Experiment)
    3. [Configure Training](#Configure-Training)
    4. [Analyzing Training Job](#Analyzing-Training-Job)
6. [Hyperparameter tuning Job](#Automatic-Model-Tuning)
    1. [Configure HPO Job](#Configure-HPO-Job)
    2. [Associate HPO to Experiment](#Associate-HPO-to-Experiment)
7. [Clean Up](#Clean-up)

# Introduction
This lab is focused on SageMaker Training for CV. We'll show an example for the performant Pipe Mode data ingestion, HyperParameter Optimization, as well as experiment tracking. In the future labs we'll show how experiment tracking can be automated through SageMaker Pipeline's native integration. The model used for this notebook is a simple deep CNN that is based on the [Sagemaker PyTorch examples](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/frameworks/pytorch_cnn_cifar10/pytorch_cnn_cifar10.html). 

** Note: This Notebook was tested on Python 3 (PyTorch 1.8 Python 3.6) kernal for SageMaker**

## Prerequisites

To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:

- Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_python_cifar10" prefix of the bucket.
- Familiarity with Python and numpy
- Basic familiarity with AWS S3.
- Basic understanding of AWS Sagemaker.
- Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
- SageMaker Studio is preferred for the full UI integration

## Setup

Setting up the environment, load the libraries, and define the parameter for the entire notebook.

Run the cell below if you are missing smexperiments or Tensorflow in your kernel

In [1]:
!pip install sagemaker-experiments  



In [2]:
import os
import time
import pytz
import boto3
import sagemaker 
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
import torchvision, torch
import torchvision.transforms as transforms

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial

In [3]:
sagemaker_session = sagemaker.Session()
sess = boto3.Session()
sm = sess.client("sagemaker")

role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = "cv_pytorch_cifar10"

print("Bucket: {}".format(bucket))
print("SageMaker ver: " + sagemaker.__version__)

Bucket: sagemaker-us-east-1-035622474239
SageMaker ver: 2.72.0


## Dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](../statics/CIFAR-10.png)

In this tutorial, we will train a deep CNN to recognize these images.

Downloading the test and training data takes around 5 minutes.

In [4]:
transform = transforms.Compose(
        [transforms.Resize(227),
         transforms.ToTensor(),
         transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# get some random training images
#dataiter = iter(trainloader)
#images, labels = dataiter.next()
#imshow(images[1])

Files already downloaded and verified
Files already downloaded and verified


## Training a CV model

## Create Experiment

[SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) helps you organize, track, compare and evaluate machine learning (ML) experiments and model versions. SInce ML is a highly iterative process, Experiment helps data scientists and ML engineers to explore thousands of different models in an organized manner.  Exspecially when you are using tools like [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) and [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html), it will help you explore a large number of combinations automatically, and quickly zoom in on high-performance models.

We will first create an experiment for a training job, and then do an example for Automatic Model Tuning.

In [5]:
cv_experiment = Experiment.create(
    experiment_name=f"manual-experiment-cv-pytorch-{int(time.time())}",
    description="CV Workshop example",
    sagemaker_boto_client=sm,
)

Uploading the data to s3

In [6]:
from sagemaker.s3 import S3Uploader

dataset_location = S3Uploader.upload("data", "s3://{}/{}/data".format(bucket, prefix)) 
display(dataset_location)

's3://sagemaker-us-east-1-035622474239/cv_pytorch_cifar10/data'

## Configure Training

### Define Custom Metrics
SageMaker can get training metrics directly from the logs and send them to CloudWatch metrics.

In [61]:
pytorch_metric_definition = [
   {
      "Name":"validation:accuracy", 
      "Regex":".*'val_acc': ([0-9\\.]+).*"
   },
   {
      "Name":"validation:loss", 
      "Regex":".*'val_loss': ([0-9\\.]+).*"
   }
]

### Build A Training Estimator

In [62]:
hyperparameters = {"epochs": 2, "batch_size": 4} 

trial_name = f"cv-pytorch-training-job-{int(time.time())}"
cnn_trial = Trial.create(
    trial_name=trial_name,
    experiment_name=cv_experiment.experiment_name,
    sagemaker_boto_client=sm,
)

experiment_config={
            "ExperimentName": cv_experiment.experiment_name,
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
} 

estimator = PyTorch(
    base_job_name="cv-pytorch-pipe",
    entry_point="source_dir/cifar10.py", 
    role=role,
    framework_version="1.4.0",
    py_version="py3",
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    metric_definitions=pytorch_metric_definition,
    enable_sagemaker_metrics=True,
    #input_mode="Pipe",    
)


In [63]:
estimator.fit(dataset_location, wait=True, logs=True, experiment_config=experiment_config)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: cv-pytorch-pipe-2022-04-21-17-40-08-869


2022-04-21 17:40:09 Starting - Starting the training job...
2022-04-21 17:40:38 Starting - Preparing the instances for trainingProfilerReport-1650562809: InProgress
.........
2022-04-21 17:42:07 Downloading - Downloading input data
2022-04-21 17:42:07 Training - Downloading the training image......
2022-04-21 17:42:58 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-04-21 17:42:55,625 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-04-21 17:42:55,629 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-04-21 17:42:55,640 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-04-21 17:42:55,643 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-04-21 1

The **```fit```** method will create a training job on **ml.c5.xlarge** instance.

These instances will write checkpoints and logs to the S3 bucket we've set up earlier. If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints and logs can be used for restoring the training job, and to analyze training job metrics. 

## Analyzing Training Job

You can set `logs=True` in the above fit call in order to see the container logs directly in the notebook. Alternatively you can view the SageMaker console under "Training Jobs" for a more user friendly report with links to CloudWatch for the full logs indefinetely.

Since we specified an Experiment trial, you can also view the "SageMaker resources" icon  in SageMaker Studio, select "Experiments and trials", open the trial, and eplorer trial details to view metric charts, summary stats, and hyperparameters associated with the experiment.

![Experiment UI](../statics/Experiments.png)

## Automatic Model Tuning

[Amazon SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

### Configure HPO Job
Next, the tuning job with the following configurations need to be specified:
- hyperparameters that SageMaker Automatic Model Tuning will tune: `learning-rate` and `batch-size`;
- maximum number of training jobs it will run to optimize the objective metric: `6`
- number of parallel training jobs that will run in the tuning job: `2`
- objective metric that Automatic Model Tuning will use: `validation:accuracy`

**Note: you may ran into resource limits in your account. If you do, please raise a support case to increase the limit**

In [64]:
shared_hyperparameters = {"epochs": 4}


estimator = PyTorch(
    base_job_name="cv-pytorch-pipe",
    entry_point="source_dir/cifar10.py", 
    role=role,
    framework_version="1.4.0",
    py_version="py3",
    hyperparameters=shared_hyperparameters,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    metric_definitions=pytorch_metric_definition,
    enable_sagemaker_metrics=True,
    #input_mode="Pipe",    
) 

In [65]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.00001, 0.001),
    "batch_size": CategoricalParameter([4, 32, 64]),
    #"optimizer": CategoricalParameter(["sgd", "adam", "rmsprop"]),
}

objective_metric_name = "validation:accuracy"

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions=pytorch_metric_definition,
    objective_type="Maximize",
    max_jobs=6,
    max_parallel_jobs=2,
    base_tuning_job_name="cv-hpo",
)
tuner.fit(dataset_location)

#tuner.fit(inputs)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating hyperparameter tuning job with name: cv-hpo-220421-1746


..................................................................................................................................................................................................................................................!


## Associate HPO to Experiment
This process is can be eliminated when expecuted from a [SageMaker Pipeline Tuning Step](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep)

After running the code below, you should see something like this in your studio environment:
![HPO Experiments](../statics/HPO_experiments.png)


In [66]:
from smexperiments.search_expression import Filter, Operator, SearchExpression
from smexperiments.trial_component import TrialComponent

In [67]:
# Get the most recently created tuning job

list_tuning_jobs_response = sm.list_hyper_parameter_tuning_jobs(
    SortBy="CreationTime", SortOrder="Descending"
)
print(f'Found {len(list_tuning_jobs_response["HyperParameterTuningJobSummaries"])} tuning jobs.')
tuning_jobs = list_tuning_jobs_response["HyperParameterTuningJobSummaries"]
most_recently_created_tuning_job = tuning_jobs[0]
tuning_job_name = most_recently_created_tuning_job["HyperParameterTuningJobName"]
experiment_name = "cv-hpo-experiment"
trial_name = tuning_job_name + "-trial"

print(f"Associate all training jobs created by {tuning_job_name} with trial {trial_name}")

Found 8 tuning jobs.
Associate all training jobs created by cv-hpo-220421-1746 with trial cv-hpo-220421-1746-trial


In [68]:
# Create the experiment if it doesn't exist
try:
    experiment = Experiment.load(experiment_name=experiment_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        experiment = Experiment.create(experiment_name=experiment_name)


# create the trial if it doesn't exist
try:
    trial = Trial.load(trial_name=trial_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)

In [69]:
# Get the trial components derived from the training jobs

creation_time = most_recently_created_tuning_job["CreationTime"]
creation_time = creation_time.astimezone(pytz.utc)
creation_time = creation_time.strftime("%Y-%m-%dT%H:%M:%SZ")

created_after_filter = Filter(
    name="CreationTime",
    operator=Operator.GREATER_THAN_OR_EQUAL,
    value=str(creation_time),
)

# The training job names contain the tuning job name (and the training job name is in the source arn)
source_arn_filter = Filter(
    name="TrialComponentName", operator=Operator.CONTAINS, value=tuning_job_name
)
source_type_filter = Filter(
    name="Source.SourceType", operator=Operator.EQUALS, value="SageMakerTrainingJob"
)

search_expression = SearchExpression(
    filters=[created_after_filter, source_arn_filter, source_type_filter]
)

# Search iterates over every page of results by default
trial_component_search_results = list(
    TrialComponent.search(search_expression=search_expression, sagemaker_boto_client=sm)
)
print(f"Found {len(trial_component_search_results)} trial components.")

Found 6 trial components.


In [70]:
# Associate the trial components with the trial
for tc in trial_component_search_results:
    print(f"Associating trial component {tc.trial_component_name} with trial {trial.trial_name}.")
    trial.add_trial_component(tc.trial_component_name)
    # sleep to avoid throttling
    time.sleep(0.5)

Associating trial component cv-hpo-220421-1746-005-18be7b35-aws-training-job with trial cv-hpo-220421-1746-trial.
Associating trial component cv-hpo-220421-1746-006-4b7d34c4-aws-training-job with trial cv-hpo-220421-1746-trial.
Associating trial component cv-hpo-220421-1746-004-45b121df-aws-training-job with trial cv-hpo-220421-1746-trial.
Associating trial component cv-hpo-220421-1746-003-e81c7584-aws-training-job with trial cv-hpo-220421-1746-trial.
Associating trial component cv-hpo-220421-1746-001-a32ff3aa-aws-training-job with trial cv-hpo-220421-1746-trial.
Associating trial component cv-hpo-220421-1746-002-cb7aced8-aws-training-job with trial cv-hpo-220421-1746-trial.


## Clean up
To avoid incurring charges to your AWS account for the resources used in this tutorial you need to remove all data and model artifacts from the SageMaker S3 bucket.