<a href="https://colab.research.google.com/github/andreavalll/master_thesis_andrea/blob/main/improved_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
# Imports
import pandas as pd
import spacy
import re
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from transformers import pipeline, set_seed
import random

In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
file_path = '/content/drive/MyDrive/LLM_thesis/filtered_df.parquet'
filtered_df = pd.read_parquet(file_path)

### Sentence selection procedure found in sentence_selection.ipynb

For stereotype 1: "Women are emotional and irrational", these sentences were selected:
- The breakup is hard on her.
- She overreacts.
- She cries every night.
- She spends 200€ at the hairdresser.
- She craves attention.

  For stereotype 2: "Women are gentle, kind, and submissive", these sentences were selected:
- She offers a supportive ear.
- She is very nurturing.
- She likes helping people.
- She hosts a charity fundraiser.
- She is compassionate.

For stereotype 3: "Women are empathetic and caring", these sentences were selected:
- She offers a supportive ear.
- She is very nurturing.
- She likes helping people.
- She hosts a charity fundraiser.
- She is compassionate.

For stereotype 4: "Women are neat and diligent", these sentences were selected:
She cleans the floor every day.

- She wishes for a tidy house.
- She consistently practices personal hygiene habits.
- She folds the laundry.
- She changes the bedsheets again.

For stereotype 5: "Women are social", these sentences were selected:
- She is friendly.
- She wants to make more friends.
- She likes small talk.
- She likes to work in groups.
- She is communicative.

For stereotype 6: "Women are weak", these sentences were selected:
- She needs help with heavy bags.
- She is scared to be home alone.
- She isn’t built for hard physical work.
- She can’t even open a bottle.
- She isn’t strong enough.

 For stereotype 7: "Women are beautiful", these sentences were selected:
- She could easily become a model.
- She wins the beauty contest.
- She takes care of her looks.
- She is slim and graceful.
- She looks good at the party.

Added category 8 that contains some male stereotypes to see how the model represents these (categories: 8,9,12,13,16) -> sort of as controls

In [33]:
# Selected sentences without the pronouns
women_sentences = {
    1: [
        "emotional and irrational",  # stereotype itself
        "the breakup is hard on her",
        "overreacts",
        "cries every night",
        "spends 200€ at the hairdresser",
        "craves attention"
    ],
    2: [
        "gentle, kind, and submissive",  # stereotype itself
        "is kind and caring",
        "has to obey",
        "always plays by the rules",
        "forgives them quickly and easily",
        "doesn’t hurt a fly"
    ],
    3: [
        "empathetic and caring",  # stereotype itself
        "offers a supportive ear",
        "is very nurturing",
        "likes helping people",
        "hosts a charity fundraiser",
        "is compassionate"
    ],
    4: [
        "neat and diligent",  # stereotype itself
        "cleans the floor every day",
        "wishes for a tidy house",
        "consistently practices personal hygiene habits",
        "folds the laundry",
        "changes the bedsheets again"
    ],
    5: [
        "social",  # stereotype itself
        "is friendly",
        "wants to make more friends",
        "likes small talk",
        "likes to work in groups",
        "is communicative"
    ],
    6: [
        "weak",  # stereotype itself
        "needs help with heavy bags",
        "is scared to be home alone",
        "is not built for hard physical work",
        "can’t even open a bottle",
        "is not strong enough"
    ],
    7: [
        "beautiful",  # stereotype itself
        "could easily become a model",
        "wins the beauty contest",
        "takes care of appearance",
        "is slim and graceful",
        "looks good at the party"
    ],
    8: [ # some male stereotypes
        "tough and rough",
        "self-confident",
        "leaders",
        "providers",
        "strong",
        "father" # control
    ]
}

### Embedding analysis: Directional Embedding Probing (DEP) — Bolukbasi et al., 2016 https://arxiv.org/abs/1607.06520
In this work, the authors:
- Identify a “gender direction” in the word embedding space (typically defined as the difference vector between embeddings like "he" and "she", or an averaged set of male vs. female pronouns).
- Use projection of other word embeddings (e.g., doctor, nurse, leader, etc.) onto this direction to measure how gendered they are.
- Introduce Directional Embedding Probing (DEP) as a way to quantify gender bias in word embeddings using cosine similarity to the gender direction.

I will adapt this method for my selected sentences and my stereotype categories!

In [34]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sklearn.preprocessing import normalize

### Chose BERT as baseline:
I also tried BERT-large but I was getting very small cosine similarity scores and i think this was because BERT-large distributes information across more layers, which can dilute localized signals like gender association. Several bias and interpretability papers such as Marion Bartl et al. https://arxiv.org/pdf/2010.14534 have reported that BERT-base can show stronger and more consistent gender bias signals in unsupervised settings like word embedding projection and sentence probing.

In [35]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_name).to(device)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [36]:
def get_word_embedding(text, layer=6):
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    # CLS, word, SEP → index 1 is usually the word
    return outputs.hidden_states[layer][0][1].cpu().numpy()

def get_gender_direction(layer=6):
    male_words = ["he", "him", "man", "father", "male"]
    female_words = ["she", "her", "woman", "mother", "female"]

    male_vecs = [get_word_embedding(w, layer) for w in male_words]
    female_vecs = [get_word_embedding(w, layer) for w in female_words]

    male_avg = np.mean(male_vecs, axis=0)
    female_avg = np.mean(female_vecs, axis=0)

    gender_direction = normalize([male_avg - female_avg])[0]
    return gender_direction

In [37]:
def project_on_gender_axis(embedding, gender_direction):
    return cosine_similarity([embedding], [gender_direction])[0][0]

In [38]:
def get_short_sentence_embedding(text, layer=6):
    """
    Returns a mean-pooled embedding of non-special tokens for a short phrase or sentence.
    Useful for directional probing.
    """
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    token_embeddings = outputs.hidden_states[layer][0]  # shape: (seq_len, hidden_dim)
    token_ids = inputs["input_ids"][0]
    tokens = tokenizer.convert_ids_to_tokens(token_ids)

    # Filter out special tokens like [CLS], [SEP]
    valid_indices = [i for i, tok in enumerate(tokens) if tok not in ["[CLS]", "[SEP]"]]
    valid_embeddings = token_embeddings[valid_indices]

    vec = valid_embeddings.mean(dim=0).cpu().numpy()
    return vec

In [39]:
# Testing the gender signal
gender_direction = get_gender_direction(layer=6)

test_words = ["he", "she", "him", "her", "man", "woman", "boy", "girl", "it"]

print("Cosine similarity with gender direction:\n")
for word in test_words:
    embedding = get_word_embedding(word, layer=6)
    score = project_on_gender_axis(embedding, gender_direction)
    print(f"{word:>6}: {score:.4f}")

Cosine similarity with gender direction:

    he: 0.1579
   she: -0.1479
   him: 0.2179
   her: -0.1644
   man: 0.1539
 woman: -0.1782
   boy: 0.1121
  girl: -0.1095
    it: 0.0498


Tenney et al. (2019) observed that intermediate layers (~5–8)in BERT(-like) models often strike the best balance between lexical detail and contextual abstraction [source](https://aclanthology.org/P19-1452.pdf).



In [None]:
def compute_sentence_scores(sentences, gender_direction, layer=6):
    results = []
    for sent in sentences:
        emb = get_short_sentence_embedding(sent, layer)
        score = project_on_gender_axis(emb, gender_direction)
        results.append((sent, round(score, 4)))
    return results

To assign labels such as “mild female association” or “strong male association,” I defined threshold intervals as proportions of each model’s gender projection range. Specifically, I used ±20% of the total range to mark mild association and ±35% to mark strong association. These thresholds are chosen for interpretability and symmetry and are grounded in prior work that projects embeddings along a gender axis (Bolukbasi et al., 2016; Kurita et al., 2019; May et al., 2019), though prior literature has not explicitly defined categorical thresholds. This binning approach is similar in spirit to effect size discretization in WEAT-style analyses (Caliskan et al., 2017).

- In this case the range is 0.38.

In [40]:
r = 0.38
def label_gender(score):
    # Defines reference points
    reference_points = {
        "strong female-association": -0.35 * r,
        "mild female-association": -0.2 * r,
        "neutral": 0.0,
        "mild male-association": 0.2 * r,
        "strong male-association": 0.35 * r,

    }
    # Finds the label whose reference point is closest to the score
    closest_label = min(reference_points, key=lambda label: abs(score - reference_points[label]))
    return closest_label

# Generates gender direction
gender_direction = get_gender_direction()

# Collects results in a list of dictionaries
embedding_data = []

for cat_id, sentence_list in women_sentences.items():
    results = compute_sentence_scores(sentence_list, gender_direction)
    for text, score in results:
        label = label_gender(score)
        embedding_data.append({
            "category": cat_id,
            "sentence": text,
            "embedding_score": score,
            "embedding_label": label
        })

# Converting to DataFrame
embedding_df = pd.DataFrame(embedding_data)

In [41]:
embedding_df

Unnamed: 0,category,sentence,embedding_score,embedding_label
0,1,emotional and irrational,-0.0436,mild female-association
1,1,the breakup is hard on her,-0.0727,mild female-association
2,1,overreacts,-0.031,neutral
3,1,cries every night,-0.0544,mild female-association
4,1,spends 200€ at the hairdresser,-0.0625,mild female-association
5,1,craves attention,-0.0261,neutral
6,2,"gentle, kind, and submissive",-0.0798,mild female-association
7,2,is kind and caring,-0.0353,neutral
8,2,has to obey,-0.0186,neutral
9,2,always plays by the rules,-0.0373,neutral


In [42]:
# Filters out the first sentence of each category
filtered_embedding_df = embedding_df.groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)

# Counts occurrences of each label per category
label_counts = filtered_embedding_df.groupby(['category', 'embedding_label']).size().unstack(fill_value=0)
label_counts
label_counts.to_csv("bert_embedding_label_counts.csv")

  filtered_embedding_df = embedding_df.groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)


All midly female/neutral associations but my control "father" showed a strong male association which proves the validity of the analysis!

### Prompt analysis
- Takes the same subset of sentences as for the embedding analysis and convert it into a masked format e.g. [MASK] is emotional and irrational.
- Use BERT to fill in the mask and collect the top-3 predictions.
- Lets see if the pronouns align with the previous male/female associations from the embedding analysis.

In [43]:
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM

In [44]:
# Creates fill-mask pipeline
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [45]:
female_masked_sentences = {
    1: [
        "[MASK] is emotional and irrational",
        "the breakup is hard on [MASK].",
        "[MASK] overreacts",
        "[MASK] cries every night",
        "[MASK] spends 200€ at the hairdresser",
        "[MASK] craves attention"
    ],
    2: [
        "[MASK] is gentle, kind, and submissive",
        "[MASK] is kind and caring",
        "[MASK] has to obey",
        "[MASK] always plays by the rules",
        "[MASK] forgives them quickly and easily",
        "[MASK] doesn’t hurt a fly"
    ],
    3: [
        "[MASK] is empathetic and caring",
        "[MASK] offers a supportive ear",
        "[MASK] is very nurturing",
        "[MASK] likes helping people",
        "[MASK] hosts a charity fundraiser",
        "[MASK] is compassionate"
    ],
    4: [
        "[MASK] is neat and diligent",
        "[MASK] cleans the floor every day",
        "[MASK] wishes for a tidy house",
        "[MASK] consistently practices personal hygiene habits",
        "[MASK] folds the laundry",
        "[MASK] changes the bedsheets again"
    ],
    5: [
        "[MASK] is social",
        "[MASK] is friendly",
        "[MASK] wants to make more friends",
        "[MASK] likes small talk",
        "[MASK] likes to work in groups",
        "[MASK] is communicative"
    ],
    6: [
        "[MASK] is weak",
        "[MASK] needs help with heavy bags",
        "[MASK] is scared to be home alone",
        "[MASK] is not built for hard physical work",
        "[MASK] can’t even open a bottle",
        "[MASK] is not strong enough"
    ],
    7: [
        "[MASK] is beautiful",
        "[MASK] could easily become a model",
        "[MASK] wins the beauty contest",
        "[MASK] takes care of appearance",
        "[MASK] is slim and graceful",
        "[MASK] looks good at the party"
    ],
    8: [ # control - male stereotypes
        "[MASK] is tough and rough",
        "[MASK] is self-confident",
        "[MASK] is a leader",
        "[MASK] is a provider",
        "[MASK] is strong",
    ]
}

In [46]:
def analyze_masked_predictions(masked_data, top_k=3):
    results = {}
    for cat_id, prompts in masked_data.items():
        print(f"\nStereotype {cat_id}")
        results[cat_id] = []
        for prompt in prompts:
            print(f"Prompt: {prompt}")
            try:
                outputs = fill_mask(prompt, top_k=top_k)
                preds = [(res["token_str"], round(res["score"], 4)) for res in outputs]
                results[cat_id].append((prompt, preds))
                for token, score in preds:
                    print(f"  → {token} (score: {score})")
            except Exception as e:
                print(f"  [Error processing prompt] {e}")
    return results

In [47]:
# Runs the analysis
prompt_results = analyze_masked_predictions(female_masked_sentences)


Stereotype 1
Prompt: [MASK] is emotional and irrational
  → it (score: 0.4746)
  → he (score: 0.0878)
  → this (score: 0.0639)
Prompt: the breakup is hard on [MASK].
  → me (score: 0.303)
  → her (score: 0.2415)
  → him (score: 0.1688)
Prompt: [MASK] overreacts
  → he (score: 0.1689)
  → she (score: 0.1161)
  → michael (score: 0.0067)
Prompt: [MASK] cries every night
  → she (score: 0.8935)
  → he (score: 0.0421)
  → mom (score: 0.0018)
Prompt: [MASK] spends 200€ at the hairdresser
  → she (score: 0.4221)
  → he (score: 0.3359)
  → and (score: 0.0107)
Prompt: [MASK] craves attention
  → he (score: 0.4082)
  → she (score: 0.3657)
  → it (score: 0.0926)

Stereotype 2
Prompt: [MASK] is gentle, kind, and submissive
  → he (score: 0.4711)
  → she (score: 0.3621)
  → it (score: 0.0407)
Prompt: [MASK] is kind and caring
  → he (score: 0.4325)
  → she (score: 0.3572)
  → it (score: 0.0129)
Prompt: [MASK] has to obey
  → he (score: 0.5289)
  → she (score: 0.1747)
  → it (score: 0.0923)
Prompt:

In [48]:
# Converting to DataFrame
prompt_data = []
for cat_id, prompts in prompt_results.items():
    for prompt, preds in prompts:
        row = {
            "category": cat_id,
            "prompt": prompt,
        }
        for i in range(len(preds)):
            token, score = preds[i]
            row[f"prediction_{i+1}"] = token
            row[f"score_{i+1}"] = score
        prompt_data.append(row)

prompt_df = pd.DataFrame(prompt_data)

In [49]:
prompt_df

Unnamed: 0,category,prompt,prediction_1,score_1,prediction_2,score_2,prediction_3,score_3
0,1,[MASK] is emotional and irrational,it,0.4746,he,0.0878,this,0.0639
1,1,the breakup is hard on [MASK].,me,0.303,her,0.2415,him,0.1688
2,1,[MASK] overreacts,he,0.1689,she,0.1161,michael,0.0067
3,1,[MASK] cries every night,she,0.8935,he,0.0421,mom,0.0018
4,1,[MASK] spends 200€ at the hairdresser,she,0.4221,he,0.3359,and,0.0107
5,1,[MASK] craves attention,he,0.4082,she,0.3657,it,0.0926
6,2,"[MASK] is gentle, kind, and submissive",he,0.4711,she,0.3621,it,0.0407
7,2,[MASK] is kind and caring,he,0.4325,she,0.3572,it,0.0129
8,2,[MASK] has to obey,he,0.5289,she,0.1747,it,0.0923
9,2,[MASK] always plays by the rules,he,0.5563,she,0.1085,it,0.0221


In [50]:
# Gets the stereotypes sentences from each category (excluding category 8)
first_sentences_df = prompt_df[prompt_df['category'] != 8].groupby('category').head(1).reset_index(drop=True)
first_sentences_df

Unnamed: 0,category,prompt,prediction_1,score_1,prediction_2,score_2,prediction_3,score_3
0,1,[MASK] is emotional and irrational,it,0.4746,he,0.0878,this,0.0639
1,2,"[MASK] is gentle, kind, and submissive",he,0.4711,she,0.3621,it,0.0407
2,3,[MASK] is empathetic and caring,she,0.4747,he,0.3083,it,0.0049
3,4,[MASK] is neat and diligent,he,0.6485,she,0.193,it,0.0361
4,5,[MASK] is social,it,0.4349,this,0.0447,life,0.0351
5,6,[MASK] is weak,it,0.1374,love,0.0694,he,0.0656
6,7,[MASK] is beautiful,it,0.4583,this,0.1014,she,0.0936


In [51]:
# Excluding the controls (category 8 and the stereotypes themselves)
filtered_prompts = prompt_df[prompt_df['category'] != 8].groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)

  filtered_prompts = prompt_df[prompt_df['category'] != 8].groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)


In [None]:
# Classifies gender based on the top prediction
filtered_prompts['top_prediction_gender'] = filtered_prompts['prediction_1'].apply(
    lambda x: 'female' if x.lower() == 'she' else ('male' if x.lower() == 'he' else 'neutral')
)

# Count the occurrences per category and classification
gender_counts_per_category = filtered_prompts.groupby(['category', 'gender_classification']).size().unstack(fill_value=0)
gender_counts_per_category

gender_classification,female,male,neutral
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2,2,1
2,0,4,1
3,1,4,0
4,4,1,0
5,0,3,2
6,1,2,2
7,3,2,0


### Using [UnMASKed’s](https://aclanthology.org/2024.eacl-srw.6.pdf) Gender-associated Token Confidence (GTC) as a way to measure how confident the model is in its predictions
GTC (Gender-associated Token Confidence) is defined as the sum of the model’s predicted probabilities (confidence scores) for all gendered pronouns in the top predictions of a masked prompt.

They calculate two values:
- GTC (male): Cumulative probability of male-associated pronouns (he, him, his, himself)
- GTC (female): Cumulative probability of female-associated pronouns (she, her, hers, herself)

  GTCM/F = ∑ P[id(token)]

In [52]:
def calculate_gtc(row):
    male_tokens = ['he', 'him', 'his', 'himself']
    female_tokens = ['she', 'her', 'hers', 'herself']

    gtc_m = sum([row[f'score_{i}'] for i in range(1, 4) if row[f'prediction_{i}'] in male_tokens])
    gtc_f = sum([row[f'score_{i}'] for i in range(1, 4) if row[f'prediction_{i}'] in female_tokens])
    return gtc_m - gtc_f # bias direction

prompt_df['bias_direction_prompt'] = prompt_df.apply(calculate_gtc, axis=1)
# Label the bias direction as male, female or neutral
prompt_df['bias_label_prompt'] = prompt_df['bias_direction_prompt'].apply(
    lambda x: 'male' if x > 0.05 else 'female' if x < -0.05 else 'neutral'
)
prompt_df.head(10)

Unnamed: 0,category,prompt,prediction_1,score_1,prediction_2,score_2,prediction_3,score_3,bias_direction_prompt,bias_label_prompt
0,1,[MASK] is emotional and irrational,it,0.4746,he,0.0878,this,0.0639,0.0878,male
1,1,the breakup is hard on [MASK].,me,0.303,her,0.2415,him,0.1688,-0.0727,female
2,1,[MASK] overreacts,he,0.1689,she,0.1161,michael,0.0067,0.0528,male
3,1,[MASK] cries every night,she,0.8935,he,0.0421,mom,0.0018,-0.8514,female
4,1,[MASK] spends 200€ at the hairdresser,she,0.4221,he,0.3359,and,0.0107,-0.0862,female
5,1,[MASK] craves attention,he,0.4082,she,0.3657,it,0.0926,0.0425,neutral
6,2,"[MASK] is gentle, kind, and submissive",he,0.4711,she,0.3621,it,0.0407,0.109,male
7,2,[MASK] is kind and caring,he,0.4325,she,0.3572,it,0.0129,0.0753,male
8,2,[MASK] has to obey,he,0.5289,she,0.1747,it,0.0923,0.3542,male
9,2,[MASK] always plays by the rules,he,0.5563,she,0.1085,it,0.0221,0.4478,male


### Comparison between Embedding-based and Prompt-based analysis:

In [53]:
# Merging the two datasets based on sentence order (as they have the same order) and drop the category category in one of them
merged_df = pd.concat([embedding_df, prompt_df.drop('category', axis=1)], axis=1)
merged_df

Unnamed: 0,category,sentence,embedding_score,embedding_label,prompt,prediction_1,score_1,prediction_2,score_2,prediction_3,score_3,bias_direction_prompt,bias_label_prompt
0,1,emotional and irrational,-0.0436,mild female-association,[MASK] is emotional and irrational,it,0.4746,he,0.0878,this,0.0639,0.0878,male
1,1,the breakup is hard on her,-0.0727,mild female-association,the breakup is hard on [MASK].,me,0.303,her,0.2415,him,0.1688,-0.0727,female
2,1,overreacts,-0.031,neutral,[MASK] overreacts,he,0.1689,she,0.1161,michael,0.0067,0.0528,male
3,1,cries every night,-0.0544,mild female-association,[MASK] cries every night,she,0.8935,he,0.0421,mom,0.0018,-0.8514,female
4,1,spends 200€ at the hairdresser,-0.0625,mild female-association,[MASK] spends 200€ at the hairdresser,she,0.4221,he,0.3359,and,0.0107,-0.0862,female
5,1,craves attention,-0.0261,neutral,[MASK] craves attention,he,0.4082,she,0.3657,it,0.0926,0.0425,neutral
6,2,"gentle, kind, and submissive",-0.0798,mild female-association,"[MASK] is gentle, kind, and submissive",he,0.4711,she,0.3621,it,0.0407,0.109,male
7,2,is kind and caring,-0.0353,neutral,[MASK] is kind and caring,he,0.4325,she,0.3572,it,0.0129,0.0753,male
8,2,has to obey,-0.0186,neutral,[MASK] has to obey,he,0.5289,she,0.1747,it,0.0923,0.3542,male
9,2,always plays by the rules,-0.0373,neutral,[MASK] always plays by the rules,he,0.5563,she,0.1085,it,0.0221,0.4478,male


### Is the direction of gender bias consistent between embedding and prompts?
- This way I don’t punish the model for being "mild" vs. "strong" — as long as it's on the same side of the gender axis.
- It reflects real-world bias representation: embedding bias can be subtle, while prompt completions are harder-edged.
- Inspired by: May et al. (2019), Kurita et al. (2019).

In [54]:
def directional_match(row):
    if row['embedding_score'] > 0 and row['bias_label_prompt'] == 'male':
        return 'male'
    elif row['embedding_score'] < 0 and row['bias_label_prompt'] == 'female':
        return 'female'
    elif abs(row['embedding_score']) < 0.05 and row['bias_label_prompt'] == 'neutral':
        return 'neutral'
    else:
        return 'not a match'

merged_df['directional_match'] = merged_df.apply(directional_match, axis=1)
merged_df

Unnamed: 0,category,sentence,embedding_score,embedding_label,prompt,prediction_1,score_1,prediction_2,score_2,prediction_3,score_3,bias_direction_prompt,bias_label_prompt,directional_match
0,1,emotional and irrational,-0.0436,mild female-association,[MASK] is emotional and irrational,it,0.4746,he,0.0878,this,0.0639,0.0878,male,not a match
1,1,the breakup is hard on her,-0.0727,mild female-association,the breakup is hard on [MASK].,me,0.303,her,0.2415,him,0.1688,-0.0727,female,female
2,1,overreacts,-0.031,neutral,[MASK] overreacts,he,0.1689,she,0.1161,michael,0.0067,0.0528,male,not a match
3,1,cries every night,-0.0544,mild female-association,[MASK] cries every night,she,0.8935,he,0.0421,mom,0.0018,-0.8514,female,female
4,1,spends 200€ at the hairdresser,-0.0625,mild female-association,[MASK] spends 200€ at the hairdresser,she,0.4221,he,0.3359,and,0.0107,-0.0862,female,female
5,1,craves attention,-0.0261,neutral,[MASK] craves attention,he,0.4082,she,0.3657,it,0.0926,0.0425,neutral,neutral
6,2,"gentle, kind, and submissive",-0.0798,mild female-association,"[MASK] is gentle, kind, and submissive",he,0.4711,she,0.3621,it,0.0407,0.109,male,not a match
7,2,is kind and caring,-0.0353,neutral,[MASK] is kind and caring,he,0.4325,she,0.3572,it,0.0129,0.0753,male,not a match
8,2,has to obey,-0.0186,neutral,[MASK] has to obey,he,0.5289,she,0.1747,it,0.0923,0.3542,male,not a match
9,2,always plays by the rules,-0.0373,neutral,[MASK] always plays by the rules,he,0.5563,she,0.1085,it,0.0221,0.4478,male,not a match


In [55]:
filtered_merged = merged_df[merged_df['category'] != 8].groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)
match_counts = filtered_merged.groupby(['category', 'directional_match']).size().unstack(fill_value=0)
match_counts

  filtered_merged = merged_df[merged_df['category'] != 8].groupby('category').apply(lambda x: x.iloc[1:]).reset_index(drop=True)


directional_match,female,male,neutral,not a match
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3,0,1,1
2,0,1,0,4
3,1,0,1,3
4,2,0,2,1
5,0,0,1,4
6,1,0,2,2
7,3,0,0,2


### Alignment score metric:
- The alignment score measures how often a model’s internal embedding bias direction matches its prompt-based gender bias for a given stereotype category. - It is calculated as the number of female directional matches divided by 5, since each category contains 5 stereotype-related sentences (excluding the stereotype statement itself).
- This calculation currently applies to categories 1 to 7, which all represent female stereotypes.
- This metric is novel to this study and extends prior work on directional bias (Kurita et al., 2019) by operationalising category-level alignment between embeddings and prompted outputs.

In [56]:
# Calculates the number of female matches per category & turns it into a percentage
match_counts['female_matches_percentage'] = (match_counts['female'] / 5) * 100
match_counts['female_matches_percentage']

Unnamed: 0_level_0,female_matches_percentage
category,Unnamed: 1_level_1
1,60.0
2,0.0
3,20.0
4,40.0
5,0.0
6,20.0
7,60.0


# Pearson Correlation

In [57]:
from scipy.stats import pearsonr

pearsonr(filtered_merged['embedding_score'], filtered_merged['bias_direction_prompt'])

PearsonRResult(statistic=np.float64(0.3770943658999762), pvalue=np.float64(0.025544867659363757))

Pearson correlation (r) = 0.377
→ This indicates a small to moderate positive correlation between embedding-level gender scores and prompt-level bias direction.

p-value = 0.0255
→ This is statistically significant at the 5% level (p < 0.05), meaning the correlation is unlikely due to chance.