# Using SageMaker Debugger to monitor attentions in BERT model training

See: https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_custom_container

## Fine-Tuning a RoBERTa Model and Create a Text Classifier (Sentiment Analysis)

The BERT model's attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [RoBERTa](https://arxiv.org/abs/1907.11692) - a Robustly Optimized BERT Pretraining Approach.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Retrieve Pre-Processed Data

In [None]:
%store -r processed_train_data_s3_uri

In [None]:
print(processed_train_data_s3_uri)
!aws s3 ls $processed_train_data_s3_uri/

In [None]:
%store -r processed_validation_data_s3_uri

In [None]:
print(processed_validation_data_s3_uri)
!aws s3 ls $processed_validation_data_s3_uri/

In [None]:
%store -r processed_test_data_s3_uri

In [None]:
print(processed_test_data_s3_uri)
!aws s3 ls $processed_test_data_s3_uri/

# Specify S3 `Distribution Strategy`

In [None]:
from sagemaker.inputs import TrainingInput

s3_input_train_data = TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

# Setup Hyper-Parameters for Classification Layer

In [None]:
max_seq_len=64

In [None]:
model_name='roberta-base'
epochs=3
lr=2e-5
train_batch_size=64
train_steps_per_epoch=100
validation_batch_size=64
test_batch_size=64
seed=42
backend='gloo'
train_instance_count=2
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
enable_sagemaker_debugger=True
enable_checkpointing=False
input_mode='File'
run_validation=True
run_test=True
run_sample_predictions=True

In [None]:
hyperparameters={
        'model_name': model_name,
        'epochs': epochs,
        'lr': lr,
        'train_batch_size': train_batch_size,
        'train_steps_per_epoch': train_steps_per_epoch,
        'validation_batch_size': validation_batch_size,
        'test_batch_size': test_batch_size,
        'seed': seed,
        'max_seq_len': max_seq_len,
        'backend': backend,
        'enable_checkpointing': enable_checkpointing,
        'enable_sagemaker_debugger': enable_sagemaker_debugger,
        'run_validation': run_validation,
        'run_sample_predictions': run_sample_predictions}

# Setup Metrics To Track Model Performance

These sample log lines...
```
[step: 0] val_loss: 0.55 - val_acc: 74.64%
```

...will produce the following 4 metrics in CloudWatch:

`val_loss` =  0.55

`val_accuracy` = 74.64

<img src="img/cloudwatch_train_accuracy.png" width="50%" align="left">

<img src="img/cloudwatch_train_loss.png" width="50%" align="left">

In [None]:
metric_definitions = [
     {'Name': 'train:loss', 'Regex': 'train_loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'train_acc: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9\\.]+)'},
]

# Setup SageMaker Debugger
Define Debugger Rules as described here:  https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

SageMaker Debugger provides default collections for gradients, weights and biases. The default `save_interval` is 100 steps. A step presents the work done by the training job for one batch (i.e. forward and backward pass). 

In this example we are also interested in attention scores, query and key output tensors. We can emit them by just defining a new [collection](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection). In this example we call the collection `all` and define the corresponding regex. We save every iteration during validation phase (`eval.save_interval`) and only the first iteration during training phase (`train.save_steps`).


We also add the following lines in the validation loop to record the string representation of input tokens:
```python
if hook.get_collections()['all'].save_config.should_save_step(modes.EVAL, hook.mode_steps[modes.EVAL]):  
   hook._write_raw_tensor_simple("input_tokens", input_tokens)
```

In [None]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig
from sagemaker.debugger import TensorBoardOutputConfig

In [None]:
debugger_hook_config = DebuggerHookConfig(
    s3_output_path='s3://{}'.format(bucket),
    hook_parameters={
        "save_interval": "1",
        "train.save_interval": "1",
        "eval.save_interval": "1"
    },
    collection_configs=[
        CollectionConfig(
            name="all",
            parameters={
                "include_regex": ".*",
                "train.save_interval": "1",
                "eval.save_interval": "1"
            }
        )
    ]
)

# Setup Our RoBERTa + PyTorch Script to Run on SageMaker
Prepare our PyTorch model to run on the managed SageMaker service

In [None]:
from sagemaker.pytorch import PyTorch as PyTorchEstimator

estimator = PyTorchEstimator(
    entry_point='train.py',
    source_dir='src',
    role=role,
    instance_count=train_instance_count,
    instance_type=train_instance_type,
    volume_size=train_volume_size,
    py_version='py3',
    framework_version='1.6.0',
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    input_mode=input_mode,
    # rules=rules,
    debugger_hook_config=debugger_hook_config
)

In [None]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
                     },
              wait=False)

In [None]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [None]:
estimator.latest_training_job.wait(logs=False)

# _Wait Until the ^^ Training Job ^^ Completes Above!_

We can check the S3 location of tensors:

In [None]:
tensor_path = estimator.latest_job_debugger_artifacts_path()
print('Tensors are stored in: {}'.format(tensor_path))

### Get tensors and visualize BERT model training in real-time
In this section, we will retrieve the tensors of our training job and create the attention-head view and neuron view as described in [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf).

First we create the [trial](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#Trial) that points to the tensors in S3:

In [None]:
from smdebug.trials import create_trial

trial = create_trial(tensor_path)

In [None]:
trial.tensor_names

Next we import a script that implements the visualization for attentation head view in Bokeh.

In [None]:
from utils import attention_head_view, neuron_view
from ipywidgets import interactive

We will use the tensors from the validation phase. In the next cell we check if such tensors are already available or not.

In [None]:
import time
import numpy as np
from smdebug import modes

while (True):
    if len(trial.steps(modes.EVAL)) == 0:
        print("Tensors from validation phase not available yet")
    else:
        step = trial.steps(modes.EVAL)[0]
        break
    time.sleep(15) 

Once the validation phase started, we can retrieve the tensors from S3. In particular we are interested in outputs of the attention cells which gives the attention score. First we get the tensor names of the attention scores:

In [None]:
tensor_names = []

for tname in sorted(trial.tensor_names(regex='.*')):
    tensor_names.append(tname)

Next we iterate over the available tensors of the validation phase. We retrieve tensor values with `trial.tensor(tname).value(step, modes.EVAL)`. Note: if training is still in progress, not all steps will be available yet. 

In [None]:
steps = trial.steps(modes.EVAL)
tensors = {}

for step in steps:
    print("Reading tensors from step", step)
    for tname in tensor_names: 
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step, modes.EVAL)
num_heads = tensors[tname][step].shape[1]

Next we get the query and key output tensor names:

In [None]:
layers = []
layer_names = {}

for index, (key, query) in enumerate(zip(trial.tensor_names(regex='.*key_output_'), trial.tensor_names(regex='.*query_output_'))):
    layers.append([key,query])
    layer_names[key.split('_')[1]] = index

We also retrieve the string representation of the input tokens that were input into our model during validation.

In [None]:
input_tokens = trial.tensor('input_tokens').value(0, modes.EVAL)

#### Attention Head View

The attention-head view shows the attention scores between different tokens. The thicker the line the higher the score. For demonstration purposes, we will limit the visualization to the first 20 tokens. We can select different attention heads and different layers. As training progresses attention scores change and we can check that by selecting a different step. 

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [None]:
n_tokens = 20
view = attention_head_view.AttentionHeadView(input_tokens, 
                                             tensors,  
                                             step=trial.steps(modes.EVAL)[0],
                                             layer='bertencoder0_transformer0_multiheadattentioncell0_output_1',
                                             n_tokens=n_tokens)

In [None]:
interactive(view.select_layer, layer=tensor_names)

In [None]:
interactive(view.select_head, head=np.arange(num_heads))

In [None]:
interactive(view.select_step, step=trial.steps(modes.EVAL))

The following code cell updates the dictionary `tensors`  with the latest tensors from the training the job. Once the dict is updated we can go to above code cell `attention_head_view.AttentionHeadView` and re-execute this and subsequent cells in order to plot latest attentions.

In [None]:
all_steps = trial.steps(modes.EVAL)
new_steps = list(set(all_steps).symmetric_difference(set(steps)))

for step in new_steps: 
    for tname in tensor_names:  
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step, modes.EVAL)

#### Neuron view

To create the neuron view as described in paper [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf), we need to retrieve the queries and keys from the model. The tensors are reshaped and transposed to have the shape: *batch size, number of attention heads, sequence length, attention head size*

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [None]:
queries = {}
steps = trial.steps(modes.EVAL)

for step in steps:
    print("Reading tensors from step", step)
    
    for tname in trial.tensor_names(regex='.*query_output'):
       query = trial.tensor(tname).value(step, modes.EVAL)
       query = query.reshape((query.shape[0], query.shape[1], num_heads, -1))
       query = query.transpose(0,2,1,3)
       if tname not in queries:
            queries[tname] = {}
       queries[tname][step] = query

Retrieve the key vectors:

We can now select different query vectors and see how they produce different attention scores. We can also select different steps to see how attention scores, query and key vectors change as training progresses. The neuron view shows:
* Query
* Key
* Query x Key (element wise product)
* Query * Key (dot product)

In [None]:
view = neuron_view.NeuronView(input_tokens, 
                              keys=keys, 
                              queries=queries, 
                              layers=layers, 
                              step=trial.steps(modes.EVAL)[0], 
                              n_tokens=n_tokens,
                              layer_names=layer_names)

In [None]:
interactive(view.select_query, query=np.arange(n_tokens))

In [None]:
interactive(view.select_layer, layer=layer_names.keys())

In [None]:
interactive(view.select_step, step=trial.steps(modes.EVAL))

In [None]:
model_s3_uri = estimator.model_data
print(model_s3_uri)

In [None]:
!mkdir -p ./tmp/model/

In [None]:
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./tmp/model/model.tar.gz

In [None]:
!tar -xvzf ./tmp/model/model.tar.gz -C ./tmp/model/

# Analyze Debugger Rules

In [None]:
#estimator.latest_training_job.rule_job_summary()

In [None]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)


In [None]:
from smdebug.trials import create_trial
trial = create_trial(training_job_debugger_artifacts_path)

In [None]:
trial.tensor_names()

# Pass Variables to the Next Notebook(s)

In [None]:
%store model_s3_uri

In [None]:
%store training_job_name

In [None]:
#%store training_job_debugger_artifacts_path

In [None]:
%store

# Release Resources

In [None]:
#%%javascript
#Jupyter.notebook.save_checkpoint();
#Jupyter.notebook.session.delete();