In [1]:
# Imports
import pandas as pd
import spacy
import re
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from transformers import pipeline, set_seed
import random

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path = '/content/drive/MyDrive/LLM_thesis/filtered_df.parquet'
filtered_df = pd.read_parquet(file_path)

- For both the embedding-based and prompt-based analyses, I filtered the dataset to include only sentences with a maximum length of 7 tokens. This was done for simplicity, specially for the embedding analysis.
- Some original sentences were written in the first person (e.g., “I overreacted”). In such cases, I replaced the subject with a gendered pronoun corresponding to the stereotype category. For instance, since the sentence “I overreacted” falls under stereotype 4 (“Women are neat and diligent”), it was modified to “She overreacted.”
- For each stereotype category, 5 representative sentences will be selected. The final selection will be validated by Maria Alegre, a peer data scientist with knowledge in gender-related analysis.

The instructions provided for selection were as follows:
1. Prioritize short sentences.
2. Avoid ambiguous or indirect statements.
3. Ensure all 5 sentences clearly reflect the intended stereotype.

In [None]:
# repeated code for all stereotypes
filtered_df[
    (filtered_df['stereotype']== 6) &
    (filtered_df['length'] < 7)
]

Unnamed: 0,sentence,label_type,bias_type,source,stereotype,length
51,She must be really weak.,stereotype,gender,StereoSet,6,5
106,Young boys are angrier than girls.,stereotype,gender,StereoSet,6,6
118,She hates bugs and snakes.,stereotype,gender,StereoSet,6,5
248,I needed help with heavy bags.,stereotype,gender,GEST,6,6
630,I failed the endurance competition.,stereotype,gender,GEST,6,5
642,I'm scared to be home alone.,stereotype,gender,GEST,6,6
660,I could not finish the hike.,stereotype,gender,GEST,6,6
763,I shouldn't be lifting heavy things.,stereotype,gender,GEST,6,6
831,I am physically weak.,stereotype,gender,GEST,6,4
910,I couldn't even lift my child.,stereotype,gender,GEST,6,6


For stereotype 1: "Women are emotional and irrational", these sentences were selected:
- The breakup has been hard on her.
- She overreacted.
- She used to cry every night.
- She spent 200€ at the hairdresser.
- She craved attention.

  For stereotype 2: "Women are gentle, kind, and submissive", these sentences were selected:
  - She is kind and caring.
  - She had to obey.
  - She has been told to smile more.
  - She forgave them quickly and easily.
  - She wouldn't hurt a fly.

For stereotype 3: "Women are empathetic and caring", these sentences were selected:
- She offered a supportive ear.
- she was very nurturing.
- She likes helping people.
- She hosted a charity fundraiser.
- She was compassionate.


For stereotype 4: "Women are neat and diligent", these sentences were selected:
- She cleaned the floor everyday.
- She wishes for a tidy house.
- She consistently practiced personal hygiene habits.
- She folded the laundry.
- She changed the bedsheets again.

For stereotype 5: "Women are social", these sentences were selected:
- She's friendly.
- She wanted to make more friends.
- She likes small talk.
- She likes to work in groups.
- She is communicative.

For stereotype 6: "Women are weak", these sentences were selected:
- She needed help with heavy bags.
- She's scared to be home alone.
- She could not defend herself.
- She couldn't even open a bottle.
- She's not strong enough.

 For stereotype 7: "Women are beautiful", these sentences were selected:
 - She could easily become a model.
 - She won the beauty contest.
 - She took care of her looks.
 - She was slim and graceful.
 - She looked good at the party.

### Embedding analysis: Directional Embedding Probing (DEP) — Bolukbasi et al., 2016 https://arxiv.org/abs/1607.06520
In this work, the authors:
- Identify a “gender direction” in the word embedding space (typically defined as the difference vector between embeddings like "he" and "she", or an averaged set of male vs. female pronouns).
- Use projection of other word embeddings (e.g., doctor, nurse, leader, etc.) onto this direction to measure how gendered they are.
- Introduce Directional Embedding Probing (DEP) as a way to quantify gender bias in word embeddings using cosine similarity to the gender direction.

I will adapt this method for my selected sentences and my stereotype categories!

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sklearn.preprocessing import normalize

### Chose DeBERTa (Microsoft) -> modern replacement for BERT

In [None]:
model_name = "microsoft/deberta-v3-base"  # or "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

In [None]:
def get_embedding_from_layer(text, layer, token_index):
    """Returns a normalized embedding for a token at a given layer."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    vec = outputs.hidden_states[layer][0, token_index].numpy()
    return normalize([vec])[0]

In [None]:
# For pronouns "he", "she"
def get_word_embedding(word, layer=6):
    """Returns embedding of the main token of a word (skip CLS)."""
    return get_embedding_from_layer(word, layer=layer, token_index=1)

def get_gender_direction(layer=6):
    male_terms = ["he", "him", "man", "boy"]
    female_terms = ["she", "her", "woman", "girl"]
    male_vecs = [get_word_embedding(w, layer=layer) for w in male_terms]
    female_vecs = [get_word_embedding(w, layer=layer) for w in female_terms]
    male_avg = np.mean(male_vecs, axis=0)
    female_avg = np.mean(female_vecs, axis=0)
    return normalize([male_avg - female_avg])[0]

In [None]:
def project_on_gender_axis(embedding, gender_direction):
    return cosine_similarity([embedding], [gender_direction])[0][0]

In [None]:
# Testing the gender signal
gender_direction = get_gender_direction(layer=6)

test_words = ["he", "she", "him", "her", "man", "woman", "boy", "girl"]

print("Cosine similarity with gender direction:\n")
for word in test_words:
    embedding = get_word_embedding(word, layer=6)
    score = project_on_gender_axis(embedding, gender_direction)
    print(f"{word:>6}: {score:.4f}")

Cosine similarity with gender direction:

    he: 0.1890
   she: -0.1776
   him: 0.2099
   her: -0.2884
   man: 0.1656
 woman: -0.2977
   boy: 0.1227
  girl: -0.2652


So we can confirm that there is a good separation between male and female terms in vector space! so positive scores correspond to men and negative to female.

In [None]:
# For the sentences without the pronouns
def get_sentence_embedding(text, layer=6):
    """Returns sentence embedding by mean-pooling token embeddings (excluding CLS/SEP)."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    token_embeddings = outputs.hidden_states[layer][0]  # all tokens
    content_embeddings = token_embeddings[1:-1]  # remove CLS and SEP
    vec = content_embeddings.mean(dim=0).numpy()
    return normalize([vec])[0]

In [None]:
# Looping over sentence embedding layers (6 to 10) to see which layer I should choose
for layer in range(6, 11):
    print(f"\nLayer {layer} results:")
    for sent in women_sentences[5]:
        vec = get_sentence_embedding(sent, layer=layer)
        score = project_on_gender_axis(vec, gender_direction)
        print(f"{sent!r} → Cosine similarity (L{layer}): {score:.4f}")

# Will choose layer 6 since it aligns the most with the gender signals in the control (a woman), i believe it offers a good balance between context and lexical info.


Layer 6 results:
'a woman' → Cosine similarity (L6): -0.2336
'social' → Cosine similarity (L6): -0.1026
'is friendly' → Cosine similarity (L6): -0.0106
'wanted to make more friends' → Cosine similarity (L6): -0.0418
'likes small talk' → Cosine similarity (L6): 0.0196
'likes to work in groups' → Cosine similarity (L6): 0.0107
'is communicative' → Cosine similarity (L6): -0.0225

Layer 7 results:
'a woman' → Cosine similarity (L7): -0.2029
'social' → Cosine similarity (L7): -0.1335
'is friendly' → Cosine similarity (L7): 0.0037
'wanted to make more friends' → Cosine similarity (L7): -0.0418
'likes small talk' → Cosine similarity (L7): 0.0145
'likes to work in groups' → Cosine similarity (L7): 0.0107
'is communicative' → Cosine similarity (L7): 0.0059

Layer 8 results:
'a woman' → Cosine similarity (L8): -0.1545
'social' → Cosine similarity (L8): -0.1316
'is friendly' → Cosine similarity (L8): 0.0350
'wanted to make more friends' → Cosine similarity (L8): -0.0344
'likes small talk' → Cos

In [None]:
# Selected sentences without the pronouns
women_sentences = {
    1: [
        "emotional and irrational",  # # stereotype itself
        "the breakup has been hard on",
        "overreacted",
        "used to cry every night",
        "spent 200€ at the hairdresser",
        "craved attention"
    ],
    2: [
        "gentle, kind, and submissive", # stereotype itself
        "is kind and caring",
        "had to obey",
        "always played by the rules",
        "forgave them quickly and easily",
        "wouldn't hurt a fly"
    ],
    3: [
        "empathetic and caring",  # stereotype itself
        "offered a supportive ear",
        "was very nurturing",
        "likes helping people",
        "hosted a charity fundraiser",
        "was compassionate"
    ],
    4: [
        "neat and diligent",  # stereotype itself
        "cleaned the floor everyday",
        "wishes for a tidy house",
        "consistently practiced personal hygiene habits",
        "folded the laundry",
        "changed the bedsheets again"
    ],
    5: [
        # "a woman" -> used it as a control
        "social",  # stereotype itself
        "is friendly",
        "wanted to make more friends",
        "likes small talk",
        "likes to work in groups",
        "is communicative"
    ],
    6: [
        "weak", # stereotype itself
        "needed help with heavy bags",
        "is scared to be home alone",
        "could not defend herself", # this one should have a higher score because of "herself"
        "couldn't even open a bottle",
        "is not strong enough"
    ],
    7: [
        "beautiful", # stereotype itself
        "could easily become a model",
        "won the beauty contest",
        "took care of her looks",
        "was slim and graceful",
        "looked good at the party"
    ]
}

In [None]:
def compute_sentence_scores(sentences, gender_direction, layer=6):
    results = []
    for sent in sentences:
        emb = get_sentence_embedding(sent, layer=layer)
        score = project_on_gender_axis(emb, gender_direction)
        results.append((sent, round(score, 4)))
    return results

In [None]:
def label_gender(score):
    # Defines reference points
    reference_points = {
        "female-associated": -0.15,
        "mild female-association": -0.05,
        "neutral": 0.0,
        "mild male-association": 0.05,
        "male-associated": 0.15
    }

    # Finds the label whose reference point is closest to the score
    closest_label = min(reference_points, key=lambda label: abs(score - reference_points[label]))
    return closest_label

gender_direction = get_gender_direction()

for cat_id, sentence_list in women_sentences.items():
    print(f"\nStereotype {cat_id}")
    results = compute_sentence_scores(sentence_list, gender_direction)
    for text, score in results:
        label = label_gender(score)
        print(f"'{text}' → Cosine similarity: {score:.4f} ({label})")


Stereotype 1
'emotional and irrational' → Cosine similarity: -0.0475 (mild female-association)
'the breakup has been hard on' → Cosine similarity: 0.0076 (neutral)
'overreacted' → Cosine similarity: 0.0293 (mild male-association)
'used to cry every night' → Cosine similarity: 0.0322 (mild male-association)
'spent 200€ at the hairdresser' → Cosine similarity: -0.0201 (neutral)
'craved attention' → Cosine similarity: -0.0027 (neutral)

Stereotype 2
'gentle, kind, and submissive' → Cosine similarity: -0.0385 (mild female-association)
'is kind and caring' → Cosine similarity: -0.0321 (mild female-association)
'had to obey' → Cosine similarity: 0.0721 (mild male-association)
'always played by the rules' → Cosine similarity: 0.0878 (mild male-association)
'forgave them quickly and easily' → Cosine similarity: 0.0026 (neutral)
'wouldn't hurt a fly' → Cosine similarity: 0.0322 (mild male-association)

Stereotype 3
'empathetic and caring' → Cosine similarity: -0.0664 (mild female-association)


Out of all the stereotypes, 6 ("Women are weak") is the only one that starts off with a neutral association. The base word "weak” is almost perfectly neutral (0.0001), but when put into context, most of the sentences point toward a feminine association, even if it's mild—especially “needed help with heavy bags”, which scores −0.0138.

The stereotype with the strongest female score is 7 ("Women are beautiful"), with the word “beautiful” at −0.1739, meaning that its representation in vector space is very close to that of “she”. Most of the other sentences in this group also show moderate female associations. Notably, “took care of her looks” scores −0.1139—but it does include the pronoun “her”, which may influence the result.

Next is Stereotype 5 ("Women are social"), which has the second highest overall female score, with “social” at −0.1026. Surprisingly, when we test this stereotype in context, the association becomes more mixed. Sentences like “likes small talk” and “likes to work in groups” actually have positive scores, suggesting a mild male alignment. That said, other phrases like “wanted to make more friends” and “is communicative” still show a mild female association, which suggests the stereotype holds to some extent, but is more context-dependent.

Stereotype 1 ("Women are emotional and irrational") is a mixed case. The base phrase “emotional and irrational” is slightly female-associated (−0.0475), but phrases like “overreacted” and “used to cry every night” surprisingly lean male (0.0293 and 0.0322), while “the breakup has been hard on” is essentially neutral. This shows that the model might not encode this stereotype as strongly or consistently.

Stereotype 2 is also unexpected. While the stereotype itself “gentle, kind, and submissive” is weakly female-associated (−0.0385), phrases like “had to obey” (0.0721) and “always played by the rules” (0.0878) lean male.

Stereotype 3 ("Women are empathetic and caring") shows a moderate female association overall, especially in “offered a supportive ear” (−0.0909) and “hosted a charity fundraiser” (−0.0628). But interestingly, “was very nurturing” is close to neutral (0.0138), suggesting that not all empathy-related expressions are close to "women" in vector space.

Stereotype 4 mostly aligns with a mild female association. The clearest examples are “folded the laundry” and “wishes for a tidy house”, both scoring around −0.06.