# Visualizing BERT

This notebook uses the ```bertviz``` library to visualize the attention of weigths of BERT. For more information on the this great library visit the original repository [here](https://github.com/jessevig/bertviz).

In [1]:
!pip install bertviz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[K     |████████████████████████████████| 157 kB 5.0 MB/s 
[?25hCollecting transformers>=2.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 27.6 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.24.27-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 2.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 34.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hu

In [2]:
from bertviz import head_view, model_view
from transformers import DistilBertTokenizer, DistilBertModel, DistilBertForMaskedLM, FillMaskPipeline
import torch

In [10]:
# Load model and retrieve attention weights
#model_version = "yabramuvdi/distilbert-job-ads"
model_version = 'distilbert-base-uncased'
model = DistilBertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = DistilBertTokenizer.from_pretrained(model_version)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [26]:
sentence = "As a leading firm in the [MASK] sector, we hire highly skilled software engineers."
inputs = tokenizer.encode_plus(sentence, return_tensors="pt")
inputs

Schedule:
 *10 hour shift
 *8 hour shift
 Work remotely:
 *No


{'input_ids': tensor([[  101,  6134,  1024,  1008,  2184,  3178,  5670,  1008,  1022,  3178,
          5670,  2147, 19512,  1024,  1008,  2053,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [13]:
# get the tokenized version of the sentence
tokenized_sent = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokenized_sent

['[CLS]',
 'as',
 'a',
 'leading',
 'firm',
 'in',
 'the',
 '[MASK]',
 'sector',
 ',',
 'we',
 'hire',
 'highly',
 'skilled',
 'software',
 'engineers',
 '.',
 '[SEP]']

In [18]:
# remove [SEP] and [CLS] tokens
tokens_clean = tokenized_sent[1:-1]
inputs_ids = inputs["input_ids"][0][1:-1]
attention_masks = inputs["attention_mask"][0][1:-1]
inputs_new = {"input_ids": torch.unsqueeze(inputs_ids, 0), "attention_mask": torch.unsqueeze(attention_masks, 0)}
inputs_new

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[ 2004,  1037,  2877,  3813,  1999,  1996,   103,  4753,  1010,  2057,
          10887,  3811, 10571,  4007,  6145,  1012]])}

In [19]:
# get results from the model (do not accumulate gradients)
with torch.no_grad():
    output = model(**inputs_new, output_attentions=True)

# Head View
<b>The head view visualizes attention in one or more heads from a single Transformer layer.</b> Each line shows the attention from one token (left) to another (right). Line weight reflects the attention value (ranges from 0 to 1), while line color identifies the attention head. When multiple heads are selected (indicated by the colored tiles at the top), the corresponding  visualizations are overlaid onto one another.  For a more detailed explanation of attention in Transformer models, please refer to the [blog](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1).

## Usage
👉 **Hover** over any **token** on the left/right side of the visualization to filter attention from/to that token. <br/>
👉 **Double-click** on any of the **colored tiles** at the top to filter to the corresponding attention head.<br/>
👉 **Single-click** on any of the **colored tiles** to toggle selection of the corresponding attention head. <br/>
👉 **Click** on the **Layer** drop-down to change the model layer (zero-indexed).


In [20]:
print(sentence)
head_view(output.attentions, tokens_clean)

As a leading firm in the [MASK] sector, we hire highly skilled software engineers.


<IPython.core.display.Javascript object>

# Model View
<b>The model view provides a birds-eye view of attention throughout the entire model</b>. Each cell shows the attention weights for a particular head, indexed by layer (row) and head (column).  The lines in each cell represent the attention from one token (left) to another (right), with line weight proportional to the attention value (ranges from 0 to 1).  For a more detailed explanation, please refer to the [blog](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1).

## Usage
👉 **Click** on any **cell** for a detailed view of attention for the associated attention head (or to unselect that cell). <br/>
👉 Then **hover** over any **token** on the left side of detail view to filter the attention from that token.

In [21]:
model_view(output.attentions, tokens_clean)

<IPython.core.display.Javascript object>

# Filling the mask

In [22]:
# create a pipeline to get the most probable words
model_mlm = DistilBertForMaskedLM.from_pretrained(model_version)

# define pipeline
original_unmasker = FillMaskPipeline(model=model_mlm,
                                     tokenizer=tokenizer,
                                     device=-1,
                                     top_k=5)

# get predictions
original_output = original_unmasker(sentence)

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

In [23]:
# print results
print(tokenizer.convert_ids_to_tokens(tokenizer(sentence)["input_ids"]))
for result in original_output:
    print("=========")
    print(result["token_str"], f"({result['token']})", "-----", result["score"])


Results for ORIGINAL MODEL
['[CLS]', 'as', 'a', 'leading', 'firm', 'in', 'the', '[MASK]', 'sector', ',', 'we', 'hire', 'highly', 'skilled', 'software', 'engineers', '.', '[SEP]']
l i k e d (4669) ----- 0.20024436712265015
b o r n e o (15688) ----- 0.05529235303401947
# # f f e (16020) ----- 0.04908163473010063
l i o n s (7212) ----- 0.017644567415118217
e s t a t e (3776) ----- 0.011381986550986767


# Classifying job postings

In [None]:
#model_version = "yabramuvdi/distilbert-wfh"
model = DistilBertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = DistilBertTokenizer.from_pretrained(model_version)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
sentence = "Schedule:\n *10 hour shift\n *8 hour shift\n Work remotely:\n *No"
inputs = tokenizer.encode_plus(sentence, return_tensors="pt")
# get the tokenized version of the sentence
tokenized_sent = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

Schedule:
 *10 hour shift
 *8 hour shift
 Work remotely:
 *No


{'input_ids': tensor([[  101,  6134,  1024,  1008,  2184,  3178,  5670,  1008,  1022,  3178,
          5670,  2147, 19512,  1024,  1008,  2053,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
# remove [SEP] token
tokens_clean = tokenized_sent[0:-1]
inputs_ids = inputs["input_ids"][0][0:-1]
attention_masks = inputs["attention_mask"][0][0:-1]
inputs_new = {"input_ids": torch.unsqueeze(inputs_ids, 0), "attention_mask": torch.unsqueeze(attention_masks, 0)}
inputs_new

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[ 2004,  1037,  2877,  3813,  1999,  1996,   103,  4753,  1010,  2057,
          10887,  3811, 10571,  4007,  6145,  1012]])}

In [None]:
# get results from the model (do not accumulate gradients)
with torch.no_grad():
    output = model(**inputs_new, output_attentions=True)

In [None]:
print(sentence)
head_view(output.attentions, tokens_clean)

As a leading firm in the [MASK] sector, we hire highly skilled software engineers.


<IPython.core.display.Javascript object>