## Using SageMaker Debugger and SageMaker Experiments for iterative model pruning

This notebook demonstrates how we can use [SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) and [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) to perform iterative model pruning. Let's start first with a quick introduction into model pruning.

State of the art deep learning models consist of millions of parameters and are trained on very large datasets. For transfer learning we take a pre-trained model and fine-tune it on a new and typically much smaller dataset. The new dataset may even consist of different classes, so the model is basically learning a new task. This process allows us to quickly achieve state of the art results without having to design and train our own model from scratch. However, it may happen that a much smaller and simpler model would also perform well on our dataset. With model pruning we identify the importance of weights during training and remove the weights that are contributing very little to the learning process. We can do this in an iterative way where we remove a small percentage of weights in each iteration. Removing means to eliminate the entries in the tensor so its size shrinks.

We use SageMaker Debugger to get weights, activation outputs and gradients during training. These tensors are used to compute the importance of weights. We will use SageMaker Experiments to keep track of each pruning iteration: if we prune too much we may degrade model accuracy, so we will monitor number of parameters versus validation accuracy. 


In [19]:
! pip -q install sagemaker
! pip -q install sagemaker-experiments

### Get training dataset

Next we get the [Caltech101](http://www.vision.caltech.edu/Image_Datasets/Caltech101/) dataset. This dataset consists of 101 image categories. 

In [20]:
import tarfile
import requests
import os

filename = '101_ObjectCategories.tar.gz'
data_url = os.path.join("https://s3.us-east-2.amazonaws.com/mxnet-public", filename)

r = requests.get(data_url, stream=True)
with open(filename, 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk: 
            f.write(chunk)

print('Extracting {} ...'.format(filename))
tar = tarfile.open(filename, "r:gz")
tar.extractall('.')
tar.close()
print('Data extracted.')

Extracting 101_ObjectCategories.tar.gz ...
Data extracted.


And upload it to our SageMaker default bucket:

In [21]:
import sagemaker
import boto3

def upload_to_s3(path, directory_name, bucket, counter=-1):
    
    print("Upload files from" + path + " to " + bucket)
    client = boto3.client('s3')
    
    for path, subdirs, files in os.walk(path):
        path = path.replace("\\","/")
        print(path)
        for file in files[0:counter]:
            client.upload_file(os.path.join(path, file), bucket, directory_name+'/'+path.split("/")[-1]+'/'+file)
            
boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
bucket = sagemaker_session.default_bucket()

upload_to_s3("101_ObjectCategories", directory_name="101_ObjectCategories_train",  bucket=bucket)

#we will compute saliency maps for all images in the test dataset, so we will only upload 4 images 
upload_to_s3("101_ObjectCategories_test", directory_name="101_ObjectCategories_test", bucket=bucket, counter=4)

Upload files from101_ObjectCategories to sagemaker-us-east-2-005166108777
101_ObjectCategories
101_ObjectCategories/tick
101_ObjectCategories/garfield
101_ObjectCategories/airplanes
101_ObjectCategories/bonsai
101_ObjectCategories/hedgehog
101_ObjectCategories/butterfly
101_ObjectCategories/elephant
101_ObjectCategories/lobster
101_ObjectCategories/Leopards
101_ObjectCategories/saxophone
101_ObjectCategories/Faces_easy
101_ObjectCategories/grand_piano
101_ObjectCategories/dragonfly
101_ObjectCategories/anchor
101_ObjectCategories/menorah
101_ObjectCategories/octopus
101_ObjectCategories/inline_skate
101_ObjectCategories/headphone
101_ObjectCategories/laptop
101_ObjectCategories/pizza
101_ObjectCategories/windsor_chair
101_ObjectCategories/wild_cat
101_ObjectCategories/hawksbill
101_ObjectCategories/Motorbikes
101_ObjectCategories/wheelchair
101_ObjectCategories/revolver
101_ObjectCategories/buddha
101_ObjectCategories/joshua_tree
101_ObjectCategories/camera
101_ObjectCategories/pyramid

### Load and save ResNet model

First we load a pre-trained [ResNet](https://arxiv.org/abs/1512.03385) model from PyTorch model zoo. 

In [22]:
import torch
from torchvision import models
from torch import nn

model = models.resnet18(pretrained=True)

Let's have a look on the model architecture:

In [23]:
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

As we can see above, the last Linear layer outputs 1000 values, which is the number of classes the model has originally been trained on. Here, we will fine-tune the model on the Caltech101 dataset: as it has only 101 classes, we need to set the number of output classes to 101.

In [24]:
nfeatures = model.fc.in_features
model.fc = torch.nn.Linear(nfeatures, 101)

Next we store the model definition and weights in an output file. 

In [25]:
checkpoint = {'model': model,
              'state_dict': model.state_dict()}

torch.save(checkpoint, 'src/model_checkpoint')     

The following code cell creates a SageMaker experiment:

In [26]:
import boto3
from datetime import datetime
from smexperiments.experiment import Experiment

sagemaker_boto_client = boto3.client("sagemaker")

#name of experiment
timestep = datetime.now()
timestep = timestep.strftime("%d-%m-%Y-%H-%M-%S")
experiment_name = timestep + "-model-pruning-experiment"

#create experiment
Experiment.create(
    experiment_name=experiment_name, 
    description="Iterative model pruning of ResNet trained on Caltech101", 
    sagemaker_boto_client=sagemaker_boto_client)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fd3d1014e48>,experiment_name='02-05-2020-19-31-33-model-pruning-experiment',description='Iterative model pruning of ResNet trained on Caltech101',experiment_arn='arn:aws:sagemaker:us-east-2:005166108777:experiment/02-05-2020-19-31-33-model-pruning-experiment',response_metadata={'RequestId': '2bb37cd7-2487-4130-ac2b-1379efe77fbe', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2bb37cd7-2487-4130-ac2b-1379efe77fbe', 'content-type': 'application/x-amz-json-1.1', 'content-length': '116', 'date': 'Sat, 02 May 2020 19:31:33 GMT'}, 'RetryAttempts': 0})

The following code cell defines a list of tensor names that be used to compute filter ranks. The lists are defined in the Python script `model_resnet`.

In [27]:
!pip install smdebug



In [28]:
import model_resnet

activation_outputs = model_resnet.activation_outputs
gradients = model_resnet.gradients

### Iterative model pruning: step by step

Before we jump into the code for running the iterative model pruning we will walk through the code step by step. 

#### Step 0: Create trial and debugger hook coonfiguration
First we create a new trial for each pruning iteration. That allows us to track our training jobs and see which models have the lowest number of parameters and best accuracy. We use the `smexperiments` library to create a trial within our experiment.                       

In [29]:
from smexperiments.trial import Trial

trial = Trial.create(
        experiment_name=experiment_name,
        sagemaker_boto_client=sagemaker_boto_client
    )


Next we define the experiment_config which is a dictionary that will be passed to the SageMaker training.

In [30]:
experiment_config = { "ExperimentName": experiment_name, 
                      "TrialName":  trial.trial_name,
                      "TrialComponentDisplayName": "Training"}

We create a debugger hook configuration to define a custom collection of tensors to be emitted. The custom collection contains all weights and biases of the model. It also includes individual layer outputs and their gradients which will be used to compute filter ranks. Tensors are saved every 100th iteration where an iteration represents one forward and backward pass. 

In [31]:
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

debugger_hook_config = DebuggerHookConfig(
      collection_configs=[ 
          CollectionConfig(
                name="custom_collection",
                parameters={ "include_regex": ".*relu|.*weight|.*bias|.*running_mean|.*running_var|.*CrossEntropyLoss",
                             "save_interval": "100" })])

#### Step 1: Start training job
Now we define the SageMaker PyTorch Estimator. We will train the model on an `ml.p2.xlarge` instance. The model definition plus training code is defined in the entry_point file `train.py`. 

In [32]:
import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(role=sagemaker.get_execution_role(),
                  train_instance_count=1,
                  train_instance_type='ml.p2.xlarge',
                  train_volume_size=400,
                  source_dir='src',
                  entry_point='train.py',
                  framework_version='1.3.1',
                  py_version='py3',
                  metric_definitions=[ {'Name':'train:loss', 'Regex':'loss:(.*?)'}, {'Name':'eval:acc', 'Regex':'acc:(.*?)'} ],
                  enable_sagemaker_metrics=True,
                  hyperparameters = {'epochs': 10},
                  debugger_hook_config=debugger_hook_config
        )

Once we have defined the estimator object we can call `fit` which creates a ml.p2.xlarge instance on which it starts the training. We pass the experiment_config which associates the training job with a trial and an experiment. If we don't specify an `experiment_config` the training job will appear in SageMaker Experiments under `Unassigned trial components`    

In [33]:
estimator.fit(inputs={'train': 's3://{}/101_ObjectCategories_train'.format(bucket), 
                      'test': 's3://{}/101_ObjectCategories_test'.format(bucket)}, 
              experiment_config=experiment_config)


INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-19-31-36-075


2020-05-02 19:31:39 Starting - Starting the training job...
2020-05-02 19:31:41 Starting - Launching requested ML instances...
2020-05-02 19:32:39 Starting - Preparing the instances for training............
2020-05-02 19:34:17 Downloading - Downloading input data......
2020-05-02 19:35:25 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 19:37:17,613 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 19:37:17,642 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 19:37:20,654 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 19:37:21,697 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 19:37:21

#### Step 2: Get gradients, weights, biases

Once the training job has finished, we will retrieve its tensors, such as gradients, weights and biases. We use the `smdebug` library which provides functions to read and filter tensors. First we create a [trial](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#Trial) that is reading the tensors from S3. 

For clarification: in the context of SageMaker Debugger a trial is an object that lets you query tensors for a given training job. In the context of SageMaker Experiments a trial is part of an experiment and it presents a collection of training steps involved in a single training job.

In [34]:
'''
%%bash
unameOut="$(uname -s)"
case "${unameOut}" in
    Linux*)     machine=Linux;;
    Darwin*)    machine=Mac;;
esac
if [ "$machine" = "Mac" ] ; then
    PROTOC_ZIP=protoc-3.7.1-osx-x86_64.zip
    brew install unzip
    echo "1"
else
    PROTOC_ZIP=protoc-3.7.1-linux-x86_64.zip
    echo "2"
    #apt-get install sudo
    #pip install unzip
fi
curl -OL https://github.com/google/protobuf/releases/download/v3.7.1/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local include/*
rm -f $PROTOC_ZIP
'''

'\n%%bash\nunameOut="$(uname -s)"\ncase "${unameOut}" in\n    Linux*)     machine=Linux;;\n    Darwin*)    machine=Mac;;\nesac\nif [ "$machine" = "Mac" ] ; then\n    PROTOC_ZIP=protoc-3.7.1-osx-x86_64.zip\n    brew install unzip\n    echo "1"\nelse\n    PROTOC_ZIP=protoc-3.7.1-linux-x86_64.zip\n    echo "2"\n    #apt-get install sudo\n    #pip install unzip\nfi\ncurl -OL https://github.com/google/protobuf/releases/download/v3.7.1/$PROTOC_ZIP\nsudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc\nsudo unzip -o $PROTOC_ZIP -d /usr/local include/*\nrm -f $PROTOC_ZIP\n'

In [35]:
!pip freeze | grep proto
!protoc --version

protobuf==3.11.2
protobuf-compiler==1.0.20
protobuf3-to-dict==0.1.5
libprotoc 3.7.1


# Make the next cell run!!

In [36]:
from smdebug.trials import create_trial

path = estimator.latest_job_debugger_artifacts_path()
print(path)
smdebug_trial = create_trial(path)

s3://sagemaker-us-east-2-005166108777/pytorch-training-2020-05-02-19-31-36-075/debug-output
[2020-05-02 19:47:58.813 ip-172-16-82-160:32640 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-2-005166108777/pytorch-training-2020-05-02-19-31-36-075/debug-output


To access tensor values, we only need to call `smdebug_trial.tensor()`. For instance to get the outputs of the first ReLU activation at step 0 we run  `smdebug_trial.tensor('layer4.1.relu_0_output_0').value(0, mode=modes.TRAIN)`. Next we compute a filter rank for the convolutions. 

Some defintions: a filter is a collection of kernels (one kernel for every single input channel) and a filter produces one feature map (output channel). In the image below the convolution creates 64 feature maps (output channels) and uses a kernel of 5x5. By pruning a filter, an entire feature map will be removed. So in the example image below the number of feature maps (output channels) would shrink to 63 and the number of learnable parameters (weights) would be reduced by 1x5x5.

![](images/convolution.png) 


#### Step 3: Compute filter ranks

In this notebook we compute filter ranks as described in the article ["Pruning Convolutional Neural Networks for Resource Efficient Inference"](https://arxiv.org/pdf/1611.06440.pdf) We basically identify filters that are less important for the final prediction of the model. The product of weights and gradients can be seen as a measure of importance. The product has the dimension `(batch_size, out_channels, width, height)` and we get the average over `axis=0,2,3` to have a single value (rank) for each filter.

In the following code we retrieve activation outputs and gradients and compute the filter rank. 

In [37]:
import numpy as np
from smdebug import modes

def compute_filter_ranks(smdebug_trial, activation_outputs, gradients):
    filters = {}
    for activation_output_name, gradient_name in zip(activation_outputs, gradients):
        for step in smdebug_trial.steps(mode=modes.TRAIN):
            
            activation_output = smdebug_trial.tensor(activation_output_name).value(step, mode=modes.TRAIN)
            gradient = smdebug_trial.tensor(gradient_name).value(step, mode=modes.TRAIN)
            rank = activation_output * gradient
            rank = np.mean(rank, axis=(0,2,3))

            if activation_output_name not in filters:
                filters[activation_output_name] = 0
            filters[activation_output_name] += rank
    return filters

filters = compute_filter_ranks(smdebug_trial, activation_outputs, gradients)

[2020-05-02 20:13:58.222 ip-172-16-82-160:32640 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2020-05-02 20:13:59.241 ip-172-16-82-160:32640 INFO trial.py:210] Loaded all steps


Next we normalize the filters:

In [38]:
def normalize_filter_ranks(filters):
    for activation_output_name in filters:
        rank = np.abs(filters[activation_output_name])
        rank = rank / np.sqrt(np.sum(rank * rank))
        filters[activation_output_name] = rank
    return filters

filters = normalize_filter_ranks(filters)

We create a list of filters, sort it by rank and retrieve the smallest values:

In [40]:
def get_smallest_filters(filters, n):
    filters_list = []
    for layer_name in sorted(filters.keys()):
        for channel in range(filters[layer_name].shape[0]): 
            filters_list.append((layer_name, channel, filters[layer_name][channel], ))

    filters_list.sort(key = lambda x: x[2])
    filters_list = filters_list[:n]
    print("The", n, "smallest filters", filters_list)
    
    return filters_list

filters_list = get_smallest_filters(filters, 100)

The 100 smallest filters [('layer3.1.relu_0_output_0', 78, 1.5483498e-05), ('layer2.1.relu_0_output_0', 11, 4.0021172e-05), ('layer4.1.relu_0_output_0', 273, 5.297476e-05), ('layer4.0.relu_0_output_0', 170, 0.00013654676), ('layer4.1.relu_0_output_0', 208, 0.00017223414), ('layer4.0.relu_0_output_0', 300, 0.00017899263), ('layer4.0.relu_0_output_0', 73, 0.00018451516), ('layer4.1.relu_0_output_0', 304, 0.00022502147), ('layer4.1.relu_0_output_0', 34, 0.00025934185), ('layer4.0.relu_0_output_0', 118, 0.00026283372), ('layer4.1.relu_0_output_0', 255, 0.00027957495), ('layer4.1.relu_0_output_0', 192, 0.00032759082), ('layer4.0.relu_0_output_0', 152, 0.0003959918), ('layer2.1.relu_0_output_0', 111, 0.0004314011), ('layer3.1.relu_0_output_0', 46, 0.0004343964), ('layer2.0.relu_0_output_0', 125, 0.0004375938), ('layer4.1.relu_0_output_0', 382, 0.00048023413), ('layer2.1.relu_0_output_0', 101, 0.0005084389), ('layer4.0.relu_0_output_0', 288, 0.00052248844), ('layer4.0.relu_0_output_0', 276, 0

#### Step 4 and step 5: Prune low ranking filters and set new weights

Next we prune the model, where we remove filters and their corresponding weights. 

In [41]:
step = smdebug_trial.steps(mode=modes.TRAIN)[-1]

model = model_resnet.prune(model,  
                    filters_list, 
                    smdebug_trial, 
                    step)


Reduce output channels for conv layer layer1.0. from 64 to 63
Reduce bn layer layer1.0. from 64 to 63
Reduce output channels for conv layer layer2.0. from 128 to 125
Reduce bn layer layer2.0. from 128 to 125
Reduce output channels for conv layer layer2.1. from 128 to 123
Reduce bn layer layer2.1. from 128 to 123
Reduce output channels for conv layer layer3.0. from 256 to 249
Reduce bn layer layer3.0. from 256 to 249
Reduce output channels for conv layer layer3.1. from 256 to 245
Reduce bn layer layer3.1. from 256 to 245
Reduce output channels for conv layer layer4.0. from 512 to 474
Reduce bn layer layer4.0. from 512 to 474
Reduce output channels for conv layer layer4.1. from 512 to 477
Reduce bn layer layer4.1. from 512 to 477


#### Step 6: Start next pruning iteration
Once we have pruned the model, the new architecture and pruned weights will be saved under src and will be used by the next training job in the next pruning iteration.

In [42]:
# save pruned model
checkpoint = {'model': model,
              'state_dict': model.state_dict()}

torch.save(checkpoint, 'src/model_checkpoint')

#clean up
del model

#### Overall workflow
The overall workflow looks like the following:
 ![](images/workflow.png)

### Run iterative model pruning

After having gone through the code step by step, we are ready to run the full worfklow. The following cell runs 10 pruning iterations: in each iteration of the pruning a new SageMaker training job is started, where it emits gradients and activation outputs to Amazon S3. Once the job has finished, filter ranks are computed and the 100 smallest filters are removed.



In [47]:
# start iterative pruning
for pruning_step in range(10):
    
    #create new trial for this pruning step
    smexperiments_trial = Trial.create(
        experiment_name=experiment_name,
        sagemaker_boto_client=sagemaker_boto_client
    )
    experiment_config["TrialName"] = smexperiments_trial.trial_name

    print("Created new trial", smexperiments_trial.trial_name, "for pruning step", pruning_step)
    
    #start training job
    estimator = PyTorch(role=sagemaker.get_execution_role(),
                  train_instance_count=1,
                  train_instance_type='ml.p2.xlarge',
                  train_volume_size=400,
                  source_dir='src',
                  entry_point='train.py',
                  framework_version='1.3.1',
                  py_version='py3',
                  metric_definitions=[ {'Name':'train:loss', 'Regex':'loss:(.*?)'}, {'Name':'eval:acc', 'Regex':'acc:(.*?)'} ],
                  enable_sagemaker_metrics=True,
                  hyperparameters = {'epochs': 10},
                  debugger_hook_config = debugger_hook_config
        )
    
    #start training job
    estimator.fit(inputs={'train': 's3://{}/101_ObjectCategories_train'.format(bucket), 
                      'test': 's3://{}/101_ObjectCategories_test'.format(bucket)}, 
              experiment_config=experiment_config)


    print("Training job", estimator.latest_training_job.name, " finished.")
    
    # read tensors
    path = estimator.latest_job_debugger_artifacts_path()
    smdebug_trial = create_trial(path)
    
    # compute filter ranks and get 100 smallest filters
    filters = compute_filter_ranks(smdebug_trial, activation_outputs, gradients)
    filters_normalized = normalize_filter_ranks(filters)  
    filters_list = get_smallest_filters(filters_normalized, 100)
        
    #load previous model 
    checkpoint = torch.load("src/model_checkpoint")
    model = checkpoint['model']
    model.load_state_dict(checkpoint['state_dict'])
    
    #prune model
    step = smdebug_trial.steps(mode=modes.TRAIN)[-1]
    model = model_resnet.prune(model, 
                        filters_list, 
                        smdebug_trial, 
                        step)
    
    print("Saving pruned model")
    
    # save pruned model
    checkpoint = {'model': model,
                  'state_dict': model.state_dict()}
    torch.save(checkpoint, 'src/model_checkpoint')
    
    #clean up
    del model

Created new trial Trial-2020-05-02-202227-bcvd for pruning step 0


INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-20-22-27-327


2020-05-02 20:22:30 Starting - Starting the training job...
2020-05-02 20:22:32 Starting - Launching requested ML instances...
2020-05-02 20:23:30 Starting - Preparing the instances for training.........
2020-05-02 20:24:54 Downloading - Downloading input data......
2020-05-02 20:25:46 Training - Downloading the training image............
2020-05-02 20:27:57 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 20:27:58,007 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 20:27:58,033 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 20:27:58,321 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 20:27:59,230 sagemaker-containers INFO     Module default_user_module_name 

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-20-40-42-292


2020-05-02 20:40:45 Starting - Starting the training job...
2020-05-02 20:40:46 Starting - Launching requested ML instances...
2020-05-02 20:41:41 Starting - Preparing the instances for training............
2020-05-02 20:43:19 Downloading - Downloading input data......
2020-05-02 20:44:34 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 20:46:27,632 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 20:46:27,661 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 20:46:30,714 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 20:46:31,663 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 20:46:31

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-20-58-52-562


2020-05-02 20:58:55 Starting - Starting the training job...
2020-05-02 20:58:57 Starting - Launching requested ML instances...
2020-05-02 20:59:55 Starting - Preparing the instances for training............
2020-05-02 21:01:43 Downloading - Downloading input data......
2020-05-02 21:02:39 Training - Downloading the training image............
2020-05-02 21:04:45 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 21:04:46,569 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 21:04:46,598 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 21:04:46,599 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 21:04:47,509 sagemaker-containers INFO     Module default_user_module_nam

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-21-17-05-830


2020-05-02 21:17:08 Starting - Starting the training job...
2020-05-02 21:17:10 Starting - Launching requested ML instances......
2020-05-02 21:18:34 Starting - Preparing the instances for training............
2020-05-02 21:20:25 Downloading - Downloading input data......
2020-05-02 21:21:33 Training - Downloading the training image..............[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 21:23:48,101 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 21:23:48,129 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 21:23:49,547 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 21:23:50,417 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 21

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-21-36-21-118


2020-05-02 21:36:23 Starting - Starting the training job...
2020-05-02 21:36:25 Starting - Launching requested ML instances......
2020-05-02 21:37:48 Starting - Preparing the instances for training.........
2020-05-02 21:39:16 Downloading - Downloading input data......
2020-05-02 21:40:22 Training - Downloading the training image...............
2020-05-02 21:42:38 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 21:42:39,840 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 21:42:39,870 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 21:42:42,899 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 21:42:43,711 sagemaker-containers INFO     Module default_user_module_

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-21-54-57-619


2020-05-02 21:55:00 Starting - Starting the training job...
2020-05-02 21:55:01 Starting - Launching requested ML instances...
2020-05-02 21:55:59 Starting - Preparing the instances for training............
2020-05-02 21:57:35 Downloading - Downloading input data......
2020-05-02 21:58:52 Training - Downloading the training image...........
2020-05-02 22:00:39 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 22:00:40,122 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 22:00:40,150 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 22:00:46,365 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 22:00:47,168 sagemaker-containers INFO     Module default_user_module_name

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-22-13-11-011


2020-05-02 22:13:13 Starting - Starting the training job...
2020-05-02 22:13:14 Starting - Launching requested ML instances...
2020-05-02 22:14:12 Starting - Preparing the instances for training............
2020-05-02 22:16:03 Downloading - Downloading input data.........
2020-05-02 22:17:34 Training - Downloading the training image............
2020-05-02 22:19:26 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 22:19:28,207 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 22:19:28,237 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 22:19:28,239 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 22:19:28,988 sagemaker-containers INFO     Module default_user_module_

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-22-31-45-762


2020-05-02 22:31:47 Starting - Starting the training job...
2020-05-02 22:31:50 Starting - Launching requested ML instances......
2020-05-02 22:33:14 Starting - Preparing the instances for training.........
2020-05-02 22:34:38 Downloading - Downloading input data......
2020-05-02 22:35:42 Training - Downloading the training image..............[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 22:37:55,290 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 22:37:55,315 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 22:37:58,326 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 22:37:59,024 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 22:37

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-22-50-13-041


2020-05-02 22:50:14 Starting - Starting the training job...
2020-05-02 22:50:16 Starting - Launching requested ML instances...
2020-05-02 22:51:11 Starting - Preparing the instances for training............
2020-05-02 22:52:50 Downloading - Downloading input data......
2020-05-02 22:53:47 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 22:55:54,025 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 22:55:54,052 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 22:55:54,053 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 22:55:54,702 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 22:55:54

INFO:sagemaker:Creating training-job with name: pytorch-training-2020-05-02-23-07-10-116


2020-05-02 23:07:12 Starting - Starting the training job...
2020-05-02 23:07:13 Starting - Launching requested ML instances...
2020-05-02 23:08:08 Starting - Preparing the instances for training.........
2020-05-02 23:09:30 Downloading - Downloading input data.........
2020-05-02 23:10:58 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-05-02 23:12:51,469 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-05-02 23:12:51,498 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-05-02 23:12:52,920 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-05-02 23:12:53,531 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-05-02 23:12:53

As the iterative model pruning is running, we can track and visualize our experiment in SageMaker Studio. In our training script we use SageMaker debugger's `save_scalar` method to store the number of parameters in the model and the model accuracy. So we can visualize those in Studio or use the `ExperimentAnalytics` module to read and plot the values directly in the notebook.

Initially the model consisted of 11 million parameters. After 11 iterations, the number of parameters was reduced to 270k, while accuracy increased to 91% and then started dropping after 8 pruning iteration.

This means that the best accuracy can be reached if the model has a size of about 4 million parameters, while shrinking model size about 3x!

![](images/results_resnet.png)

In [51]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
accuracy = trial_component_analytics.dataframe()['scalar/accuracy_EVAL - Max']
print(accuracy)

0     0.865672
1     0.868159
2     0.880597
3     0.893035
4     0.890547
5     0.907960
6     0.902985
7     0.900498
8     0.898010
9     0.880597
10    0.848259
Name: scalar/accuracy_EVAL - Max, dtype: float64


### Additional: run iterative model pruning with custom rule

In the previous example, we have seen that accuracy drops when the model has less than 22 million parameters. Clearly, we want to stop our experiment once we reach this point. We can define a custom rule that returns `True` if the accuracy drops by a certain percentage. You can find an example implementation in `custom_rule/check_accuracy.py`. Before we can use the rule we have to define a custom rule configuration:

```python

from sagemaker.debugger import Rule, CollectionConfig, rule_configs

check_accuracy_rule = Rule.custom(
    name='CheckAccuracy',
    image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
    instance_type='ml.c4.xlarge',
    volume_size_in_gb=400,
    source='custom_rule/check_accuracy.py',
    rule_to_invoke='check_accuracy',
    rule_parameters={"previous_accuracy": "0.0", 
                     "threshold": "0.05", 
                     "predictions": "CrossEntropyLoss_0_input_0", 
                     "labels":"CrossEntropyLoss_0_input_1"},
)
```

The rule reads the inputs to the loss function, which are the model predictions and the labels. It computes the accuracy and returns `True` if its value has dropped by more than 5% otherwise `False`. 

In each pruning iteration, we need to pass the accuracy of the previous training job to the rule, which can be retrieved via the `ExperimentAnalytics` module.

```python
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
accuracy = trial_component_analytics.dataframe()['scalar/accuracy_EVAL - Max'][0]
```
And overwrite the value in the rule configuration:

```python
check_accuracy_rule.rule_parameters["previous_accuracy"] = str(accuracy)
```

In the PyTorch estimator we need to add the argument `rules = [check_accuracy_rule]`.
We can create a CloudWatch alarm and use a Lambda function to stop the training. Detailed instructions can be found [here](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_action_on_rule). In each iteration we check the job status and if the previous job has been stopped, we exit the loop:

```python
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)

if description['TrainingJobStatus'] == 'Stopped':
      break
```
