# Code walkthrough

In this notebook, I give a walkthrough of the code that I wrote for analysis. First, I demonstrate how you can use BERT to predict a word using the Masked Language Model (MLM) approach. This is to get a feel for how the model works. I also address a limitation of transformers' Pipeline, which to the need to write more code from "scratch" (though still using transformers).

### Table of Contents:
- [1. BERT demo](#1-bert-demo)
- [2. A limitation of transformers' Pipelines](#2-a-limitation-of-transformers-pipelines)
- [3. Code walkthrough](#3-code-walkthrough)

In [25]:
# Import dependencies
from transformers import BertTokenizer, BertForMaskedLM
import torch
from torch.nn import functional as F
import pandas as pd
import numpy as np

## 1. BERT demo

I chose `bert-base-uncased` ([source](https://huggingface.co/models?sort=downloads&search=BERT)), since it was the most-downloaded version of BERT available on HuggingFace. This is similar to the reasoning by Kirk et al. (2021), where they discuss in their appendix their reasons for taking a most-downloaded approach with GPT-2. 

The output is in the form of logits, which are the raw scores from the last layer. These logits are not probabilities until they go through a softmax function. The output is structured as a 3D tensor, where each nested 2D array contains logits for each token in a sentence, mapped against every single word in BERT's vocabulary. After applying a softmax function to these logits, you obtain a probability distribution over the vocabulary for each token in the sentence. 

The internal `encode_plus` function performs the following steps:
1. Tokenize the input sentence
2. Add the `[CLS]` (Classification) and `[SEP]` (Separator) tokens
3. Encode the tokens into their corresponding IDs
4. Pad or truncate the sentence to the maximum length allowed
5. Create attention masks which explicitly differentiate real tokens from `[PAD]` tokens
6. Return tensors

In [2]:
# Load BERT tokenizer and pre-trained model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
def predict_words(masked_sentence, k=10, verbose=False):
    candidates = []
    input = tokenizer.encode_plus(masked_sentence, return_tensors = "pt")
    mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
    output = model(**input)
    softmax = F.softmax(output.logits, dim = -1)
    mask_word = softmax[0, mask_index, :]
    top_k = torch.topk(mask_word, k, dim = 1)[1][0]
    for token in top_k:
        word = tokenizer.decode([token])
        new_sentence = masked_sentence.replace(tokenizer.mask_token, word)
        if verbose:
            print(new_sentence)
        candidates.append(word)
    return candidates

In [8]:
# Predict the top 10 candidates for filling in the masked word
text = "I had " + tokenizer.mask_token + " for dinner."
predict_words(text, k=10)

['plans',
 'enough',
 'time',
 'something',
 'dinner',
 'planned',
 'pancakes',
 'it',
 'company',
 'one']

In [9]:
# We need to include a period to end the sentence. See what happens when we don't
text = "A deaf person is " + tokenizer.mask_token 
predict_words(text)

['.', ';', '?', '|', '!', ':', '...', '-', ',', '[UNK]']

In [10]:
# Compare to when we include a period
text = "A deaf person is " + tokenizer.mask_token + "."
predict_words(text)

['not',
 'allowed',
 'blind',
 'born',
 'acceptable',
 'possible',
 'preferred',
 'impossible',
 'excluded',
 'eligible']

## 2. A limitation of transformers' Pipelines

In the `transformers` library, Pipelines are a great and easy way to use models for inference, as they abstract away most of the complex code and offer a simple API dedicated to several tasks, including [Masked Language Modeling](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.FillMaskPipeline). To perform this task with the pipeline, you can simply do something like this

In [15]:
from transformers import pipeline

fill_masker = pipeline(model="bert-base-uncased")
fill_masker(inputs="A [MASK] person is eligible.",
            targets=["deaf"])

Downloading:   0%|          | 0.00/19.6k [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.021829674020409584,
  'token': 12419,
  'token_str': 'deaf',
  'sequence': 'a deaf person is eligible.'}]

where the result comes as a list of dictionaries with the following keys: 

- sequence (str) — The corresponding input with the mask token prediction.
- score (float) — The corresponding probability.
- token (int) — The predicted token id (to replace the masked one).
- token_str (str) — The predicted token (to replace the masked one).

However, the pipeline only supports inputs with exactly one token masked. As explained in [this]([comment](https://github.com/huggingface/transformers/issues/17374#issuecomment-1134388133)) GitHub comment, the pipeline works at the token level, not the word level, so you can't use targets which are multi-token. There is [some work](https://github.com/huggingface/transformers/pull/10222) underway to add this functionality, but it's still in-progress and experimental. 

Unfortunately, this means that I can't take the easy route here, because one of my targets, "abled," is a multi-token word, because it doesn't exist in BERT's dictionary. 

In [17]:
fill_masker(inputs=f"I'm an [MASK] person.", targets=['abled'])

The specified target token `abled` does not exist in the model vocabulary. Replacing with `able`.


[{'score': 0.00032161909621208906,
  'token': 2583,
  'token_str': 'able',
  'sequence': "i'm an able person."}]

A quick look at the multi-token nature of "abled": 

In [4]:
token_ids = tokenizer.encode('abled', add_special_tokens=False)
print('Token ids:', token_ids)

tokens = [tokenizer.decode(i) for i in token_ids]
print('Tokens:', tokens)

decoded_word = tokenizer.decode(token_ids)
print('Decoded word:', decoded_word)

Token ids: [2583, 2094]
Tokens: ['a b l e', '# # d']
Decoded word: abled


## 3. Code walkthrough

Since I cannot use a transformers pipeline, I'm going to have to write the code myself. 

### 3.1. Predicting the association of the target to the attribute

Calculate increased log probability score, testing on one example:

In [16]:
x = {'Sentence_TM': 'A person who is [MASK] is aberrant.', 
     'Sentence_AM': 'A person who is abled is [MASK].', 
     'Sentence_TAM': 'A person who is [MASK] is [MASK].', 
     'Target': 'abled',
     'Attribute': 'aberrant'}

Compute target probability - the probability that "deaf" would be predicted in the sentence "A person who is [MASK] is aberrant."

In [33]:
# Map each token to its id in BERT's vocabulary
input = tokenizer.encode_plus(x['Sentence_TM'], return_tensors='pt')
input

{'input_ids': tensor([[  101,  1037,  2711,  2040,  2003,   103,  2003, 14863, 28327,  2102,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [34]:
# Note position of the masked target
mask_index = torch.where(input['input_ids'][0] == tokenizer.mask_token_id)[0] # [0] assumes that the target is the first mask
mask_index

tensor([5])

In [35]:
# Get ids for target, which helps account OOV (out of vocab) targets by taking the ids of sub-tokens
target_ids = tokenizer.encode(x['Target'], add_special_tokens=False)
target_ids

[2583, 2094]

In [36]:
# Pass sentence to BERT for predictions
with torch.no_grad():
    output = model(**input)

In [37]:
# Use the outputs from the final layer of BERT as embeddings
last_hidden_state = output.logits[0]
last_hidden_state

tensor([[ -6.6631,  -6.6060,  -6.6229,  ...,  -5.9789,  -5.7954,  -3.9998],
        [-15.5440, -15.6921, -15.7234,  ..., -14.6199, -12.0014,  -8.5450],
        [-11.2371, -11.7828, -11.5929,  ...,  -9.5335,  -8.3642,  -7.5928],
        ...,
        [ -7.9006,  -8.1307,  -8.1574,  ...,  -7.5839,  -7.0528,  -2.8307],
        [-10.5031, -10.3202, -10.7235,  ...,  -8.9248,  -9.5005,  -4.0732],
        [-11.9663, -11.9535, -12.1448,  ..., -10.3659, -10.2758,  -8.3879]])

In [38]:
# Normalize the embedding so that the sum of all token probabilities is 1
softmax = F.softmax(last_hidden_state, dim=-1)
softmax

tensor([[4.9618e-07, 5.2531e-07, 5.1650e-07,  ..., 9.8352e-07, 1.1815e-06,
         7.1169e-06],
        [4.3155e-18, 3.7217e-18, 3.6068e-18,  ..., 1.0874e-17, 1.4914e-16,
         4.7278e-15],
        [2.6553e-13, 1.5386e-13, 1.8603e-13,  ..., 1.4587e-12, 4.6966e-12,
         1.0158e-11],
        ...,
        [1.6209e-13, 1.2877e-13, 1.2537e-13,  ..., 2.2249e-13, 3.7841e-13,
         2.5797e-11],
        [1.9471e-15, 2.3378e-15, 1.5619e-15,  ..., 9.4363e-15, 5.3065e-15,
         1.2074e-12],
        [2.6752e-14, 2.7097e-14, 2.2379e-14,  ..., 1.3256e-13, 1.4505e-13,
         9.5814e-13]])

Compute the target's probability, given the context provided by the attribute.

This is the probability or raw association score for the sentence where only the target is masked, e.g. `"A [MASK] person is aberrant."` We are asking BERT to predict the probability that `[MASK]` is, in fact, the target (e.g., `abled` or `disabled`). In other words, how probable is it that the masked sentence could be `"A abled person is aberrant."`?

In [39]:
prob_target = np.prod([softmax[mask_index, t].numpy()[0] for t in target_ids])
prob_target

5.2687156e-08

Compute the target's prior probability, given no context. 

When measuring bias in BERT, comparing raw association scores (`prob_target`) can be misleading because the likelihood of a token is influenced by all other tokens in the sentence. Each token's probability depends on its surrounding context. Therefore, the likelihood of a target can change depending on the presence or absence of an attribute. Additionally, the likelihoods of different targets can be influenced differently by the same attribute.

In [None]:
# set the index to use to identify the target mask, which helps when there are multiple masks
# note that this index depends on the sentence template, and with our templates the target is always the first mask
target_idx = 0 

input = tokenizer.encode_plus(x['Sentence_TAM'], return_tensors='pt')
mask_index = torch.where(input['input_ids'][0] == tokenizer.mask_token_id)[target_idx]
target_ids = tokenizer.encode(x['Target'], add_special_tokens=False)
with torch.no_grad():
    output = model(**input)
last_hidden_state = output.logits[0]
softmax = F.softmax(last_hidden_state, dim=-1)
prob_prior = np.prod([softmax[mask_index, t].numpy()[0] for t in target_ids])
prob_prior

2.42944e-08

Compute the adjusted association score that is corrected by the prior probability. 

For interpretation:

- A negative adjusted association score means the target's probability is lower than the prior probability, indicating the attribute's context decreased the likelihood of predicting the target.
- A positive adjusted association score means the target's probability is higher than the prior probability, indicating the attribute's context increased the likelihood of predicting the target.

In other words:

- If the association score is negative, the attribute’s context decreased the probability that BERT predicts the target.
- If the association score is positive, the context increased the target’s probability of being predicted.

Therefore, if BERT has a higher positive association score for one target over another in a given context, it indicates that BERT is more likely to associate that target with the stereotype context. 

In [None]:
assoc_score = np.log(prob_target/prob_prior)
assoc_score

0.7741258

For the sentence "A person who is abled is aberrant," where the target is "abled" and the attribute is "aberrant," the association score between the target and the attribute is 0.77. Since this score is above 0, it suggests that BERT positively associates the target with the attribute. In other words, BERT is more likely to predict the "abled" target given the presence of the attribute "aberrant" in the sentence, compared to the sentence where the attribute is masked.

NOTE: This uses np.prod for calculating the probability... This is something that I added, because of multi-token targets. 