In [1]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": None,
        "transition": None,
        "start_slideshow_at": "selected",
        "leap_motion": {
            "naturalSwipe"  : True,     # Invert swipe gestures
            "pointerOpacity": 0.5,      # Set pointer opacity to 0.5
            "pointerColor"  : "#d80000" # Red pointer"nat.png"
        },
        "header": "<h3>Francisco Perez-Sorrosal</h3>",
        "footer": "<h3>Machine Learning/Deep Learning</h3>",
        "scroll": True,
        "enable_chalkboard": True
     }
)

{'start_slideshow_at': 'selected',
 'leap_motion': {'naturalSwipe': True,
  'pointerOpacity': 0.5,
  'pointerColor': '#d80000'},
 'header': '<h3>Francisco Perez-Sorrosal</h3>',
 'footer': '<h3>Machine Learning/Deep Learning</h3>',
 'scroll': True,
 'enable_chalkboard': True}

In [None]:
pip install emoji --upgrade

In [None]:
import emoji
print(emoji.emojize('Presenting stuff is easy!!! :thumbs_up:'))

In [None]:
# Emojis http://getemoji.com/

# What Does BERT Look At?
### An Analysis of BERT’s Attention

## Kevin Clark, Urvashi Khandelwal, Omer Levy and Christopher D. Manning

# + Jesse Vig's Blog Post on Bert Deconstruction

### [Clark et at. pdf](https://arxiv.org/pdf/1906.04341.pdf)

### [Deconstructing BERT 1 (Dec 2018)](https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77)
### [Deconstructing BERT 2 (Jan 2019)](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1)


---

Francisco Perez-Sorrosal | 16 Mar 2021


# Summary (What do the authors sell)

💡 Language models achieve high accuracy in supervised task but not understood why.

💡 Analyze the attention mechanisms of pre-trained models and apply them to BERT.

💡 Claim that attention heads:

* Attend to delimiter tokens
* Positional offsets
* Attend broadly to whole sentence
* Heads in the same layer attend to similar behaviors
* Certain heads attend to linguistic notions (e.g. syntax and correference)

💡 Propose attention-based probing classifier (Not analyzed here)

💡 Use it to demonstrate that substantial syntactic information is captured in BERT’s attention

# Hypothesys

## Pre-training makes the model learn about the structure of the language

* What does it learn?


# Transformers and BERT

* Usual stuff on the transformers:
 - When producing the representation of the current token:
   - Attention weights govern how “important” every other token is for this one
* Special Tokens:
 - Authors claim that they play a special role in attention mechanism
 - [CLS] Added at the begining of the text
 - [SEP]:
   - Mark sentence boundaries
   - Added at the end or...
   - If input has multiple separate texts (e.g. question and context), [SEP] tokens are also used to separate them
   - Additionally, BERT incorporates sentence-level embeddings (token_type_ids in HFT),added to the input layer

# BERT Surface-Level Attention Heads Study (Clark et al.)

* 128 tokens from Wikipedia, only two paragraphs
* [CLS]\<paragraph-1\>[SEP]\<paragraph-2\>[SEP]

## Relative Position

 - Most heads puts little attention to the current token
 - **Early layers**: Some heads specialize in next/prev token
    - Prev: 2, 4, 7 and 8
    - Next: 1, 2, 3 and 6
 
## Separator Tokens

 - Early layers: Attend to CLS
 - Middle layers (5-10): Attend to SEP
 - Later layers: Attend to , and .    

 - More than half of the attention goes to the special tokens
  - Specially the deliminator token [SEP]
  - When current token is [SEP], heads attend even more at [SEP]
    
![SEP Attention](images/sep_attn.png)
 
### Analysis
 - [CLS] and [SEP] are never masked out
 - In their context [SEP] occurs 1/64 times
 - . and , are the most common tokens (after "the")

#### [SEP] is used to aggregate segment-level information
 - Then it can be further read by other layers
 - Not very likely bc:
  - It is not observed that attention heads processing [SEP] broadly attend the whole segment
 - Instead, attention heads processing [SEP] attend to themselves and another [SEP]
    
#### Authors claim [SEP] is used by the model as a sort of no-op
 - Why? Non-nouns mostly attend to [SEP] (See figure below)
![Head 8-10](images/head8-10.png)
 - How? Measuring how much changing the attention to a token will change BERT output
  - Staring in layer 5, where attention to [SEP] is high, the gradients of this attention are small
  - This means that [SEP] does not change fundamentally BERT's output
![SEP attention gradient](images/sep_attn_grad.png)    

## Broad vs Local Attention
- How? Compute average entropy of every head's attention distribution
 - Lower layers:
   - Attention heads have broad attention
   - 10% of the attention mass on a single word
   - Output: A BoW representation of the sentence 
 - Last layer:
   - Broad attention/High entropy from [CLS] (3.89)
   - [CLS] attends broadly to aggregate the whole input in the last layer
![Entropy Attention Distribution](images/entropy_attn_dist.png)

# 6 BERT Patterns (J. Vig)

Author disclaimer:

- Visualizations used as kind of Rorschach tests
  - That is, may lead to subjective interpretations
- Do not talk about the structure of the language
  - synonymy, coreference, etc.

## Pattern 1/2: Attention to next word (1) / previous word (2)

- Same conclusion is extracted by Clarke et al. in the Relative Position section
- Loosely related to a sequential RNN (next=backward/prev=forward)
- Appears over multiple layers of the model (although he doesn't specifies where)
- [SEP] apparently disrupts the next-word attention pattern
  - Attention from [SEP] is directed to [CLS]
  
![Next Token Attention](images/next_token_attn.png) 


## Pattern 3/4: Attention to identical/related words (patt. 3) in other sentence (patt. 4)

- This attention also seems to cross segments (See figures below)
- This can also be appreciated in Clark et al paper in Figure 5 in the paper for Head 7-6, 9-6 and 5-4
- Doesn't seem to happen for the special tokens [CLS]/[SEP] (See below) although in Clark's paper figures mentioned above appears lightly

#### Pattern 3
![Token Self Attention](images/token_self_attn.png) 

#### Pattern 4

![Cross Segment Attention](images/cross_segments.png) 

## Pattern 5: Attention to other words predictive of word

- Related to Word Piece tokenization used by BERT
- “straw” is directed to “##berries”, and most of the attention from “##berries” is focused on “straw”
- Seems to reinforce the idea that those to tokens are very tied together

## Pattern 6: Attention to delimiter tokens

- Same conclusion is extracted by Clarke et al. in the Separator Tokens section (Chicken or Egg)
- Does not enter in detail about [CLS]/[SEP] differences
- Cites Clark and mentions [SEP] is used as a“no-op”: an attention head focuses on the [SEP] tokens when it can’t find anything meaningful in the input sentence to focus on

![SEP Attention](images/sep_attn_vig.png)


# BERT Patterns from 2nd Blog Post (J. Vig)


## Delimiter-focused attention patterns (Pattern 6)

- How BERT fixates on the [SEP] tokens?

![SEP Attention](images/sep_attn_vig2.png)

- key vectors for the 2 [SEP] have a distinctive signature
  - small number of active neurons with strongly positive (blue) or negative (orange) values
  - a larger number of neurons with values close to zero (light blue/orange or white)
- only a few neurons are required because the pattern is relatively simple
  - also little variation in the words that receive attention

![SEP Attention Neurons](images/sep_attn_vig2_neuron.png)

- the elementwise product produces high values
  - they match the [SEP] key vector
- can we talk about [SEP]-matching neurons??? and query vectors are assigned values that match the [SEP] key vectors at these positions
  
![Elemwise Product QxK](images/elemwise_prod.png)

## Bag of Words attention pattern

- Not discussed in Part 1
- Meaning: Attention divided fairly evenly across all words in the same sentence
  - What in Clark et al. is called Broad vs Local Attention
  - BoW embedding by with an unweighted average of the word embeddings in the same sentence

![BoW Neuron](images/bow_neuron.png)

- Elementwise multiplication
  - q, k in the same sentence: the mult. shows high values (q, k have same signs)
  - q, k in the different sentences: the mult. shows negative values (q, k have opposite signs)
  
![BoW Neuron](images/bow_neuron.png)

- only a few neurons are required because the pattern is relatively simple
  - also little variation in the words that receive attention

## Next-word attention patterns (Pattern 1)

- Focusing in next/prev word makes sense
  - Adjacent words are relevant for understanding the current's word meaning (at least in english)
  - n-gram LM are based on this
  
- The pattern requires high number of neurons to figure out the attention score
  - Why? to keep track of which of the 128 (512) words receives attention from a given position

![Next Token Attention Neuron](images/next_token_attn_neuron.png) 

- The elementwise product of q2 and the key vector for k3 (the next word) is strongly positive across most neurons
  - See the strong blue patterns in the figure below
- For non-next tokens, the QxK product has some combination of positive and negative values
  
![Next Token Attention QxK](images/next_token_attn_elementwise.png) 

# Take Aways

💡 Current analysis of models focus on probing vector representations of model outputs, BUT...

 - Study of attention maps should be done as attention maps hold Linguistic knowledge
 - These tools should be part of researchers in NLP

# Reflections/Open Questions

- Pretty interesting paper/blog entries related to interpretability

- How unused special tokens affect input?
  - That is, can they learn structure or embed knowledge depending on the target task?

- Related:
 
 - [Jesse Vig](https://jessevig.com/)
 - [BERT Viz (Jesse Vig)](https://github.com/jessevig/bertviz)
 - [How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope](https://www.aclweb.org/anthology/2020.acl-main.429.pdf) ACL 2020
 
 


# PDF Paper

In [2]:
from IPython.display import IFrame
IFrame("./1906.04341.pdf", width=1500, height=1200)