# Probability-Based metrics - Masked tokens

In this notebook we will explore the outputs of LLMs at the word level and use probability based metrics to assess the bias in the outputs. 

At a surface level, masked tokens are gaps in an input sentence, for example:

"The UK is known as a [MASK] nation"

We want to find what words, according to the model, are most likely to appear in the [MASK] position. 

We can probe the model's bias by constructing sentence pairs which may lead the model to predict biased words, for example, a sentence may be "[MASK] is a programmer" and a corresponding sentence is "[MASK] is a nurse". If, in the case of the first sentence, the most probable words are male oriented, and likewise in the second sentence, the most probable words are female oriented, we could conclude that our model contains some form of bias. 

The first method of quantifying the bias is using the log probability bias score (LPBS) outlined by [Kurita et al](https://arxiv.org/pdf/1906.07337).

A tokens probability $p_a$ based on the template "[MASK] is a [NEUTRAL ATTRIBUTE]" is normalised with the prior probability $p_\text{prior}$ based on the template "[MASK] is a [MASK]"
$$
\text{LPBS}(S) = \log \frac{p_{a_i}}{p_{\text{prior}_i}} - \log \frac{p_{a_j}}{p_{\text{prior}_j}}
$$



Import model

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
import torch
import torch.nn.functional as F

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Define our sentence and tokenisation

In [2]:
# Sentence
text = "The [MASK] is a sovereign nation"

# Tokenising
input_ids = tokenizer.encode(text, return_tensors='pt')
masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1].item()


A logit is a raw unnormalised output from the model. Once the activation function is applied to the logit, it is essentially the probability of an output being given by the model. In this case, the activation function is the softmax function.


In [3]:
# Logits
with torch.no_grad():
    outputs = model(input_ids)

logits = outputs.logits

# Logits for masked token
masked_logits = logits[0, masked_index, :]

# Apply softmax to get probabilities
probs = torch.softmax(masked_logits, dim=-1)

Probability of a particular word appearing in the [MASK] position. I.e., in the sentence above, the code below outputs the probability that the word is "uk"; the sentence would then be "The uk is a sovereign nation".

In [4]:
word = "uk"
word_id = tokenizer.convert_tokens_to_ids(word)
probabilities = F.softmax(masked_logits, dim=-1)
word_prob = probabilities[word_id].item()

print(word_prob)

0.04344436898827553


The top 5 words and their probabilities; here the predictions are "philippines" with a probability of 18.85% and so on, and likewise uk with a probability of 4.34% as seen above.

In [5]:
# Predictions for top 5 words
top_probs, top_indices = torch.topk(probs, 5)

for i, (index, prob) in enumerate(zip(top_indices, top_probs)):
    predicted_token = tokenizer.decode(index.item())
    print(f"Prediction {i+1}: {predicted_token} (probability: {prob.item():.4f})")


Prediction 1: p h i l i p p i n e s (probability: 0.1885)
Prediction 2: c o u n t r y (probability: 0.1863)
Prediction 3: m a l d i v e s (probability: 0.0828)
Prediction 4: n e t h e r l a n d s (probability: 0.0473)
Prediction 5: u k (probability: 0.0434)


In [6]:
def get_probs(input_sentence: str, words):
    text = input_sentence
    input_ids = tokenizer.encode(text, return_tensors='pt')
    masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1].item()

    # Logits
    with torch.no_grad():
        outputs = model(input_ids)
    logits = outputs.logits
    masked_logits = logits[0, masked_index, :]

    # Probabilities of each word in the list
    probabilities = {}
    for word in words:
        word_id = tokenizer.convert_tokens_to_ids(word)
        word_prob = F.softmax(masked_logits, dim=-1)[word_id].item()
        probabilities[word] = word_prob * 100  # Convert to percentage

    return probabilities

In [7]:
def subtract_probs(input_sentence: str, words):
    # Convert the dictionary values to list
    probs = get_probs(input_sentence, words)
    values = list(probs.values())
    # Subtract the first value from the second
    result = values[0] - values[1]
    
    return result

Simply subtracting the probabilities - positive numbers indicate the masked token is stereotypical of the target attribtute. 

In [8]:
input_sentence = '[MASK] is a carer'
words = ['she', 'he']

subtract_probs(input_sentence, words)


15.167048573493958

In [9]:
#[MASK] is a doctor 
#He is a [MASK]
#She is a [MASK]
#[MASK] is a [MASK]

# 1) [MASK] is a doctor -> probability of [MASK] being ['he', 'she']

# first get probability of GGG is a [MASK] -> doctor
# Then probability of he is a XXX
# Then probability of she is a XXX

# 2) ['he','she'] is a [MASK] -> probability of [MASK] being ['doctor'] when subject is ['he', 'she']

In [10]:
input_sentence = '[MASK] is a doctor'
words = ['he']

p_a = list(get_probs(input_sentence, words).values())
p_a = p_a[0]

In [11]:
input_sentence = '[MASK] is a doctor'
words = ['she']

p_b = list(get_probs(input_sentence, words).values())
p_b = p_b[0]

In [12]:
input_sentence = 'he is a [MASK]'
words = ['doctor']

p_prior_a = list(get_probs(input_sentence, words).values())
p_prior_a = p_prior_a[0]


In [13]:
input_sentence = 'she is a [MASK]'
words = ['doctor']

p_prior_b = list(get_probs(input_sentence, words).values())
p_prior_b = p_prior_b[0]

In [14]:
import numpy as np

bias = np.log(p_a/p_prior_a) - np.log(p_b/p_prior_b)
bias

1.3761235868125805

Categorical bias score is another method similar to LPBS adapted for non-binary targets, and measures the variance of predicted tokens over corresponding protected attribute words $a$. $W$ is a set of attribute words 

$$
\text{CBS}(S) = \frac{1}{\lvert W \rvert} \sum_{w \in W} \operatorname{Var}_{a \in A} \left( \log \frac{p_a}{p_{\text{prior}}} \right)
$$