# Dog breed classification using PyTorch and AWS SageMAker

This notebook contains steps and code that uses AWS SageMaker to finetune a pretrained RESNET50 model to perform image classification on dogs, classifying them according to their breeds. We will make use of SageMaker's Debugger to debug, monitor and profile the training job.  ----ADD MORE DETAILS- maybe about the network used, the library used and other details----

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.

**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

In [2]:
!pip install smdebug

Collecting smdebug
  Using cached smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
Collecting pyinstrument==3.4.2
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Collecting pyinstrument-cext>=0.2.2
  Using cached pyinstrument_cext-0.2.4-cp37-cp37m-manylinux2010_x86_64.whl (20 kB)
Installing collected packages: pyinstrument-cext, pyinstrument, smdebug
Successfully installed pyinstrument-3.4.2 pyinstrument-cext-0.2.4 smdebug-1.0.12
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import sagemaker
import boto3

## Dataset

The dataset used in this project is a labeled dataset of 8,351 real-world images of 133 American Kennel Club (AKC) recognized dog breeds. It is divided into 6680 images for training, 836 images for testing, 835 images for validation.  
The images in this dataset are of different sizes and backgrounds. The number of images provided for each breed varies, some breeds have fewer images than others.

In [4]:
# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip

--2023-02-25 13:35:41--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.116.120
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.116.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip’

dogImages.zip        16%[==>                 ] 178.26M  46.6MB/s    eta 21s    ^C
Archive:  dogImages.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of dogImages.zip or
        dogImages.zip.zip, and cannot find dogImages.zip.ZIP, period.


In [4]:
# Upload the data to Amazon S3

import os
from sagemaker import get_execution_role

bucket= 'sagemaker-us-east-1-519574148523'
region = 'us-east-1'
role =  get_execution_role()
print("Default Bucket: {}".format(bucket))
print("AWS Region: {}".format(region))
print("RoleArn: {}".format(role))

os.environ["DEFAULT_S3_BUCKET"] = bucket
!aws s3 sync ./dogImages s3://${DEFAULT_S3_BUCKET}/data/

Default Bucket: sagemaker-us-east-1-519574148523
AWS Region: us-east-1
RoleArn: arn:aws:iam::519574148523:role/service-role/AmazonSageMaker-ExecutionRole-20230209T103282


## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [14]:
#Create estimators for your HPs
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
		    entry_point='hpo.py',
            role=role,
            framework_version='1.8.1',
            py_version='py3',
            instance_count=1,
            instance_type='ml.m5.large',
		)
estimator.fit({'train': "s3://{}/data/".format(bucket)})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-02-27-10-26-22-467


2023-02-27 10:26:22 Starting - Starting the training job...
2023-02-27 10:26:38 Starting - Preparing the instances for training......
2023-02-27 10:27:26 Downloading - Downloading input data.........
2023-02-27 10:29:06 Training - Downloading the training image...
2023-02-27 10:29:32 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-02-27 10:29:35,573 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-02-27 10:29:35,578 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-27 10:29:35,595 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-02-27 10:29:35,604 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-02-27 10:29:35,895 sagemaker-trainin

In [3]:
# Declare your HP ranges, metrics etc.
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([32, 64, 128])
}

objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

estimator = PyTorch.attach('pytorch-training-2023-02-27-10-26-22-467')


2023-02-27 10:45:07 Starting - Preparing the instances for training
2023-02-27 10:45:07 Downloading - Downloading input data
2023-02-27 10:45:07 Training - Training image download completed. Training in progress.
2023-02-27 10:45:07 Uploading - Uploading generated training model
2023-02-27 10:45:07 Completed - Training job completed


In [4]:

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=1,
    max_parallel_jobs=1,
    objective_type=objective_type,
)

In [5]:
# Set certain OS variables
os.environ['SM_MODEL_DIR']='s3://{}/model/'.format(bucket)
os.environ['SM_CHANNEL_TRAIN']='s3://{}/data/'.format(bucket)
os.environ['SM_OUTPUT_DATA_DIR']='s3://{}/output/'.format(bucket)

In [6]:
# Fit your HP Tuner
tuner.fit({'train': "s3://{}/data/".format(bucket)},wait=True) 

...................................................................................................................................................................................................................................................!


In [12]:
# Get the best estimators and the best HPs
best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()


2023-02-27 13:41:18 Starting - Preparing the instances for training
2023-02-27 13:41:18 Downloading - Downloading input data
2023-02-27 13:41:18 Training - Training image download completed. Training in progress.
2023-02-27 13:41:18 Uploading - Uploading generated training model
2023-02-27 13:41:18 Completed - Resource released due to keep alive period expiry


{'_tuning_objective_metric': '"average test loss"',
 'batch_size': '"32"',
 'lr': '0.08277499248803206',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2023-02-27-13-23-42-951"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-519574148523/pytorch-training-2023-02-27-10-26-22-467/source/sourcedir.tar.gz"'}

In [13]:
hyperparameters = {"batch_size": int(best_estimator.hyperparameters()['batch_size'].replace('"', '')), \
                   "learning_rate": best_estimator.hyperparameters()['lr']}
hyperparameters

{'batch_size': 32, 'learning_rate': '0.08277499248803206'}

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [6]:
# TODO: Set up debugging and profiling rules and hooks
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig
from sagemaker.debugger import DebuggerHookConfig, ProfilerConfig, FrameworkProfile

rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0",parameters={
    "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "10","eval.save_interval": "1"})]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=1)
)
debugger_config=DebuggerHookConfig(
    collection_configs=collection_configs
)

In [9]:
# TODO: Create and fit an estimator
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    entry_point="train_model.py",
    framework_version="1.8.1",
    py_version="py3",
    hyperparameters={'batch_size': 32, 'lr': '0.08277499248803206'},
    profiler_config=profiler_config, 
    debugger_hook_config=debugger_config, 
    rules=rules
)

estimator.fit({'train': "s3://{}/data/".format(bucket)},wait=True) 

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-02-28-03-03-42-245


2023-02-28 03:03:42 Starting - Starting the training job...
2023-02-28 03:04:12 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
......
2023-02-28 03:05:12 Downloading - Downloading input data......
2023-02-28 03:06:13 Training - Downloading the training image..................
2023-02-28 03:09:14 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-02-28 03:09:23,098 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-02-28 03:09:23,125 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-02-28 03:09:23,129 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023

In [10]:
training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")

Training jobname: pytorch-training-2023-02-28-03-03-42-245


In [15]:
!pip uninstall Jinja2 -y
!pip install Jinja2==3.0

Found existing installation: Jinja2 3.1.2
Uninstalling Jinja2-3.1.2:
  Successfully uninstalled Jinja2-3.1.2
[0mCollecting Jinja2==3.0
  Downloading Jinja2-3.0.0-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: Jinja2
Successfully installed Jinja2-3.0.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
# TODO: Plot a debugging output.
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

print(trial.tensor_names())
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.EVAL)))

ImportError: cannot import name 'Markup' from 'jinja2' (/opt/conda/lib/python3.7/site-packages/jinja2/__init__.py)

In [None]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(rule_output_path)

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()