## Using SageMaker debugger to monitor attentions in BERT model training

[BERT](https://arxiv.org/abs/1810.04805) is a deep bidirectional transformer model that achieves state-of the art results in NLP tasks like question answering, text classification and others.

The paper [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf) shows that plotting attentions and individual neurons in the query and key vectors can help to identify causes of incorrect model predictions.
With SageMaker Debugger we can easily retrieve those tensors and plot them in real-time as training progresses which may help to understand what the model is learning. 

The animation below shows the attention scores of the first 20 input tokens for the first 10 iterations in the training.

<img src='images/attention_scores.gif' width='350' /> 
Fig. 1: Attention scores of the first head in the 7th layer 

[1] *Visualizing Attention in Transformer-Based Language Representation Models*:  Jesse Vig, 2019, 1904.02679, arXiv

### Get tensors and visualize BERT model training in real-time
In this section, we will retrieve the tensors of our training job and create the attention-head view and neuron view as described in [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf).

First we create the [trial](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#Trial) that points to the tensors in S3:

In [None]:
import numpy as np

from smdebug.trials import create_trial
trial = create_trial( '/tmp/tensors' )

In [None]:
for i in trial.tensor_names():
    print(i)

Next we import a script that implements the visualization for attentation head view in Bokeh.

In [None]:
from utils import attention_head_view, neuron_view
from ipywidgets import interactive

Once the validation phase started, we can retrieve the tensors from S3. In particular we are interested in outputs of the attention cells which gives the attention score. First we get the tensor names of the attention scores:

In [None]:
tensor_names = []

for tname in sorted(trial.tensor_names(regex='.*attention.dropout_output_0')):
    tensor_names.append(tname)

Next we iterate over the available tensors of the validation phase. We retrieve tensor values with `trial.tensor(tname).value(step, modes.EVAL)`. Note: if training is still in progress, not all steps will be available yet. 

In [None]:
steps = trial.steps()
tensors = {}

for step in steps:
    print("Reading tensors from step", step)
    for tname in tensor_names: 
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step)
num_heads = tensors[tname][step].shape[1]

Next we get the query and key output tensor names:

In [None]:
layers = []
layer_names = {}

for index, (key, query) in enumerate(zip(trial.tensor_names(regex='.*k_lin_output_0'), trial.tensor_names(regex='.*q_lin_output_0'))):
    layers.append([key,query])
    layer_names[key.split('_')[0]] = index

We also retrieve the string representation of the input tokens that were input into our model during validation.

In [None]:
input_tokens = trial.tensor('input_tokens').value(0)

#### Attention Head View

The attention-head view shows the attention scores between different tokens. The thicker the line the higher the score. For demonstration purposes, we will limit the visualization to the first 20 tokens. We can select different attention heads and different layers. As training progresses attention scores change and we can check that by selecting a different step. 

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [None]:
n_tokens = 20
view = attention_head_view.AttentionHeadView(input_tokens, 
                                             tensors,  
                                             step=trial.steps()[0],
                                             layer='distilbert.transformer.layer.0.attention.dropout_output_0',
                                             n_tokens=n_tokens)

In [None]:
interactive(view.select_layer, layer=tensor_names)

In [None]:
interactive(view.select_head, head=np.arange(num_heads))

The following code cell updates the dictionary `tensors`  with the latest tensors from the training the job. Once the dict is updated we can go to above code cell `attention_head_view.AttentionHeadView` and re-execute this and subsequent cells in order to plot latest attentions.

In [None]:
all_steps = trial.steps()
new_steps = list(set(all_steps).symmetric_difference(set(steps)))

for step in new_steps: 
    for tname in tensor_names:  
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step)

#### Neuron view

To create the neuron view as described in paper [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf), we need to retrieve the queries and keys from the model. The tensors are reshaped and transposed to have the shape: *batch size, number of attention heads, sequence length, attention head size*

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [None]:
queries = {}
steps = trial.steps()

for step in steps:
    print("Reading tensors from step", step)
    
    for tname in trial.tensor_names(regex='.*q_lin_output_0'):
       query = trial.tensor(tname).value(step)
       query = query.reshape((query.shape[0], query.shape[1], num_heads, -1))
       query = query.transpose(0,2,1,3)
       if tname not in queries:
            queries[tname] = {}
       queries[tname][step] = query

Retrieve the key vectors:

In [None]:
keys = {}
steps = trial.steps()

for step in steps:
    print("Reading tensors from step", step)
    
    for tname in trial.tensor_names(regex='.*k_lin_output_0'):
       key = trial.tensor(tname).value(step)
       key = key.reshape((key.shape[0], key.shape[1], num_heads, -1))
       key = key.transpose(0,2,1,3)
       if tname not in keys:
            keys[tname] = {}
       keys[tname][step] = key

We can now select different query vectors and see how they produce different attention scores. We can also select different steps to see how attention scores, query and key vectors change as training progresses. The neuron view shows:
* Query
* Key
* Query x Key (element wise product)
* Query * Key (dot product)


In [None]:
view = neuron_view.NeuronView(input_tokens, 
                              keys=keys, 
                              queries=queries, 
                              layers=layers, 
                              step=trial.steps()[0], 
                              n_tokens=n_tokens,
                              layer_names=layer_names)

In [None]:
interactive(view.select_query, query=np.arange(n_tokens))

In [None]:
interactive(view.select_layer, layer=layer_names.keys())

#### Note: Jupyter widgets in JupyterLab

If you encounter issues with this notebook in JupyterLab, you may have to install JupyterLab extensions. You can do this by defining a SageMaker [Lifecycle configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html). A lifecycle configuration is a shell script that runs when you either create a notebook instance or whenever you start an instance. You can create a Lifecycle configuration directly in the SageMaker console (more details [here](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/)) When selecting `Start notebook`, copy and paste the following code. Once the configuration is created attach it to your notebook instance and start the instance.

```sh
#!/bin/bash

set -e

# OVERVIEW
# This script installs a single jupyter notebook extension package in SageMaker Notebook Instance
# For more details of the example extension, see https://github.com/jupyter-widgets/ipywidgets

sudo -u ec2-user -i <<'EOF'

# PARAMETERS
PIP_PACKAGE_NAME=ipywidgets
EXTENSION_NAME=widgetsnbextension

source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv

pip install $PIP_PACKAGE_NAME
jupyter nbextension enable $EXTENSION_NAME --py --sys-prefix
jupyter labextension install @jupyter-widgets/jupyterlab-manager
# run the command in background to avoid timeout 
nohup jupyter labextension install @bokeh/jupyter_bokeh &

source /home/ec2-user/anaconda3/bin/deactivate

EOF
```