# TODO: Title

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.


**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

In [None]:
# Install any packages that you might need
!pip install smdebug --upgrade

In [None]:
# Import all necessary packages
import sagemaker
from sagemaker.session import Session
import boto3
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

In [None]:
# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip

In [None]:
!rm -r .dogImages/train/.ipynb_checkpoints
!rm -r .dogImages/valid/.ipynb_checkpoints
!rm -r .dogImages/test/.ipynb_checkpoints

In [None]:
basedir = './dogImages'

arr_train = []
arr_valid = []
arr_test = []

for subdir, dirs, files in os.walk(basedir):
    for file in files:
        path_arr = subdir.split('/')
        folder = path_arr[2]
        label = int(path_arr[3].split('.')[0])
        if (folder == 'train'):
            arr_train.append([label, subdir, file])
        elif (folder == 'valid'):
            arr_valid.append([label, subdir, file])
        elif (folder == 'test'):
            arr_test.append([label, subdir, file])

df_train = pd.DataFrame(arr_train, columns = ['label', 'subdir', 'file']).sort_values(['label', 'file'], ignore_index = True)
df_train['row'] = range(len(df_train))

df_valid = pd.DataFrame(arr_valid, columns = ['label', 'subdir', 'file']).sort_values(['label', 'file'], ignore_index = True)
df_valid['row'] = range(len(df_valid))

df_test = pd.DataFrame(arr_test, columns = ['label', 'subdir', 'file']).sort_values(['label', 'file'], ignore_index = True)
df_test['row'] = range(len(df_test))

In [None]:
# Display number of classes
num_classes = df_train['label'].nunique()
print(f'There are {num_classes} classes in this dataset.')

# Examine a sample of rows of each dataframe
print(f'There are {len(df_train.index)} training images.')
df_train.sample(n = 10, random_state = 1)
print(f'There are {len(df_valid.index)} validation images.')
df_valid.sample(n = 10, random_state = 1)
print(f'There are {len(df_test.index)} testing images.')
df_test.sample(n = 10, random_state = 1)

In [None]:
df_breed_labels = df_train[['label', 'subdir']].copy().drop_duplicates().rename(columns = {'subdir': 'breed'}).set_index('label')
df_breed_labels['breed'] = df_breed_labels['breed'].apply(lambda row: row.split('/')[-1].split('.')[-1].replace('_', ' '))
df_breed_labels.head()

In [None]:
# Set up variables related to AWS account
session = sagemaker.Session()

bucket = session.default_bucket()
print(f'Default Bucket: {bucket}')

region = session.boto_region_name
print(f'AWS Region: {region}')

role = sagemaker.get_execution_role()
print(f'RoleArn: {role}')

In [None]:
# Upload dog images to S3
os.environ["DEFAULT_S3_BUCKET"] = bucket
!aws s3 sync ./dogImages/train/ s3://${DEFAULT_S3_BUCKET}/data/train/
!aws s3 sync ./dogImages/valid/ s3://${DEFAULT_S3_BUCKET}/data/valid/
!aws s3 sync ./dogImages/test/ s3://${DEFAULT_S3_BUCKET}/data/test/

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [None]:
# Some prerequisite inputs
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

# Declare hyperparameter ranges 
hyperparameter_ranges = {
    'lr': ContinuousParameter(1e-4, 0.1),
    'batch-size': CategoricalParameter([32, 64, 128, 256]),
    'epochs': IntegerParameter(6, 10)
}

# Declare metrics for our model
objective_metric_name = 'ValidationNumCorrect'
objective_type = 'Maximize'
metric_definitions = [
    {
        'Name': 'ValidationNumCorrect', 
        "Regex": 'Validation Accuracy: ([0-9]+)'
    },
    {
        'Name': 'ValidationLoss', 
        "Regex": 'Validation Set - Average Loss: ([0-9\\.]+)'
    }
]

In [None]:
# Create an estimator and a corresponding hyperameter tuner
estimator = PyTorch(
    entry_point = 'hpo.py',
    role = role,
    py_version = 'py36',
    framework_version = '1.8',
    instance_count = 1,
    hyperparameters = {
        'num-classes': num_classes
    },
    instance_type = 'ml.g4dn.2xlarge'
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs = 10,                    # Up to how many combinations of HP's we want to try in total
    max_parallel_jobs = 2,            # Up to how many HPO jobs we want to run at once
    objective_type = objective_type,
)

In [None]:
# Set up data channels
data_channels = {
    'train': f's3://{bucket}/data/train',
    'valid': f's3://{bucket}/data/valid'
}

# Perform HPO using the tuner
tuner.fit(data_channels)

In [None]:
# Download the best estimator model from HPO
best_estimator = tuner.best_estimator()

# Display the hyperparameters of the best trained model
tuned_hyperparameters = best_estimator.hyperparameters()
tuned_hyperparameters

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [None]:
# Set up debugging and profiling rules and hooks
from sagemaker.debugger import Rule, rule_configs, DebuggerHookConfig, CollectionConfig
from sagemaker.debugger import ProfilerRule, ProfilerConfig, FrameworkProfile
from sagemaker.debugger import DetailedProfilingConfig, DataloaderProfilingConfig,PythonProfilingConfig

rules = [
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    Rule.sagemaker(
        base_config = rule_configs.loss_not_decreasing(),
        rule_parameters={'tensor_regex': 'CrossEntropyLoss_output_0'}
    )
]

hook_config = DebuggerHookConfig(
    collection_configs = [
        CollectionConfig(
            name = 'CrossEntropyLoss_output_0',
            parameters = {
                'include_regex': 'CrossEntropyLoss_output_0', 
                'train.save_interval': '25',
                'train.start_step': '1',
                'eval.save_interval': '5',
                'eval.start_step': '1',
            }
        )
    ],
)

profiler_config = ProfilerConfig(
    system_monitor_interval_millis = 500, 
    framework_profile_params = FrameworkProfile(
        num_steps = 10,
        detailed_profiling_config = DetailedProfilingConfig(),
        dataloader_profiling_config = DataloaderProfilingConfig(),
        python_profiling_config = PythonProfilingConfig()
    )
)

In [None]:
import re

# Perform some modifications on the hyperparameters to ensure compatibility with the model. This will perform the casting in place.

tuned_hyperparameters = { hp: value for hp, value in tuned_hyperparameters.items() if hp in ['batch-size', 'epochs', 'lr', 'num-classes'] }
tuned_hyperparameters['batch-size'] = int(re.findall('\d+', tuned_hyperparameters['batch-size'])[0])
tuned_hyperparameters['epochs'] = int(tuned_hyperparameters['epochs'])
tuned_hyperparameters['lr'] = float(tuned_hyperparameters['lr'])
tuned_hyperparameters['num-classes'] = int(tuned_hyperparameters['num-classes'])

tuned_hyperparameters

In [None]:
# Create an estimator with the hyperparameters from the best estimator
new_estimator = PyTorch(
    entry_point = 'train_model.py',
    role = role,
    py_version = 'py36',
    framework_version = '1.8',
    instance_count = 1,
    hyperparameters = tuned_hyperparameters,
    instance_type = 'ml.g4dn.xlarge',
    rules = rules,
    debugger_hook_config = hook_config,
    profiler_config = profiler_config
)

# Fit the new estimator
input_channels = {
    'train': f's3://{bucket}/data/train',
    'valid': f's3://{bucket}/data/valid',
    'test': f's3://{bucket}/data/test'
}

new_estimator.fit(input_channels, wait = True)

In [None]:
job_name = new_estimator.latest_training_job.name
client = new_estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName = job_name)
description

In [None]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(new_estimator.latest_job_debugger_artifacts_path())

In [None]:
# Display a list of tensor values/variables that we can observe
trial.tensor_names()

In [None]:
# Starter code from class notes to plot the evolution of the Cross Entropy Loss.
from mpl_toolkits.axes_grid1 import host_subplot

def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode = mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

def plot_tensor(trial, tensor_name):
    steps_train, vals_train = get_data(trial, tensor_name, mode = ModeKeys.TRAIN)
    steps_eval, vals_eval = get_data(trial, tensor_name, mode = ModeKeys.EVAL)

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)

    plt.show()

In [None]:
plot_tensor(trial, 'CrossEntropyLoss_output_0')

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

job = TrainingJob(job_name, region)
job.wait_for_sys_profiling_data_to_be_available()

In [None]:
# Display the profiler output
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = job.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader = None,
    select_dimensions = ['CPU', 'GPU'],
    select_events = ['total'],
)

In [None]:
rule_output_path = new_estimator.output_path + new_estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

In [None]:
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()