# Tutorial 7-1: Auditing AI – "Bias in the Machine"

**Course:** CSEN 342: Deep Learning  
**Topic:** Algorithmic Bias, Word Embeddings, and AI Ethics

## Objective
We often think of AI models as objective mathematical tools. However, models trained on human data inevitably learn human biases. As discussed in the lecture, this can lead to harmful stereotypes being amplified in downstream applications (e.g., resume screening, search results).

In this tutorial, we will **audit** a standard AI component—Word Embeddings—to measure and visualize gender bias. We will:
1.  **Load GloVe Embeddings:** Use pre-trained vectors that represent words in a high-dimensional space.
2.  **Measure Bias:** Mathematically calculate the association between neutral professions (e.g., "programmer", "nurse") and gendered words.
3.  **Attempt a Fix:** Implement a geometric "debiasing" technique to remove this signal.

---

## Part 1: Getting the Data (GloVe)

We will use the **GloVe (Global Vectors for Word Representation)** dataset. These vectors were trained on 6 billion tokens from Wikipedia and the Gigaword corpus.

**Note:** We use `wget` to ensure this works on cluster compute nodes.

In [None]:
import numpy as np
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

# Import utility functions
import os
import sys
sys.path.append(os.path.abspath(os.path.join('..')))
from utils import download_glove_embeddings

download_glove_embeddings()

### 1.1 Loading Embeddings
We will load the vectors into a Python dictionary mapping `word -> vector`.

In [None]:
def load_glove(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

print("Loading embeddings into memory...")
embeddings = load_glove('../data/glove.6B.50d.txt')
print(f"Loaded {len(embeddings)} words.")

# Convert to PyTorch tensors for easier math later
def get_vec(word):
    return torch.tensor(embeddings[word])

---

## Part 2: Vector Arithmetic (The Analogy Test)

Word embeddings capture semantic meaning. We can perform algebra on words.
The classic example: $Vector(King) - Vector(Man) + Vector(Woman) \approx Vector(Queen)$.

In [None]:
def find_closest(target_vec, n=1, exclude_words=[]):
    # Brute force search (slow but simple for tutorial)
    # In production, use FAISS or ScaNN
    scores = []
    target_vec = target_vec / target_vec.norm() # Normalize
    
    # We'll search a subset of common words to speed this up, or just search all
    # Let's search top 20,000 words for speed
    search_space = list(embeddings.keys())[:20000]
    
    for word in search_space:
        if word in exclude_words: continue
        vec = torch.tensor(embeddings[word])
        cosine_sim = torch.dot(target_vec, vec / vec.norm())
        scores.append((word, cosine_sim.item()))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:n]

# Test: King - Man + Woman = ?
v_king = get_vec("king")
v_man = get_vec("man")
v_woman = get_vec("woman")

target = v_king - v_man + v_woman
result = find_closest(target, exclude_words=["king", "man", "woman"])
print(f"king - man + woman = {result[0][0]} (Score: {result[0][1]:.2f})")

---

## Part 3: The Bias Audit

Now we investigate bias. We define a **Gender Axis** by subtracting the vector for "he" from "she" (or "man" from "woman").

$$ g = v_{woman} - v_{man} $$

We then project various profession words onto this axis. If the projection is positive, the model associates the word more with "woman". If negative, with "man". Ideally, neutral professions should be near 0.

In [None]:
# 1. Define the Gender Direction
gender_direction = get_vec("woman") - get_vec("man")
# Normalize it
gender_direction = gender_direction / gender_direction.norm()

# 2. List of professions to audit
professions = [
    "programmer", "engineer", "scientist", "doctor", "architect", "boss", "leader", # Historically male-skewed
    "nurse", "homemaker", "teacher", "artist", "secretary", "dancer", "receptionist" # Historically female-skewed
]

# 3. Calculate projections
projections = []
for p in professions:
    vec = get_vec(p)
    # Project onto gender direction: dot product
    score = torch.dot(vec / vec.norm(), gender_direction).item()
    projections.append((p, score))

# 4. Visualize
projections.sort(key=lambda x: x[1])
words, scores = zip(*projections)

plt.figure(figsize=(10, 8))
colors = ['red' if s > 0 else 'blue' for s in scores]
plt.barh(words, scores, color=colors)
plt.axvline(0, color='black', linewidth=1)
plt.title(f"Projection onto 'Man' <---> 'Woman' Axis")
plt.xlabel("Gender Bias Score (Negative=Male, Positive=Female)")
plt.show()

### Discussion
You should see a clear pattern. Words like "nurse" and "receptionist" likely have strong positive scores (associated with "woman"), while "engineer" and "boss" have negative scores (associated with "man"). 

**Why this matters:** If we use these embeddings to rank resumes for a "Programmer" job, the model might mathematically penalize resumes containing female-coded language simply because the vector for "programmer" is far away from the vector for "woman".

---

## Part 4: Geometric Debiasing

We can attempt to fix this mathematically. We want to remove the component of the word vector that points in the gender direction.

**The Formula:**
$$ w_{debiased} = w - (w \cdot g) \times g $$
Where $w$ is the word vector and $g$ is the unit gender direction vector.

In [None]:
def neutralize(word, g_direction):
    w = get_vec(word)
    # Calculate component in direction of g
    bias_component = torch.dot(w, g_direction) * g_direction
    # Remove it
    w_debiased = w - bias_component
    return w_debiased / w_debiased.norm()

# Apply neutralization to all professions
debiased_projections = []
for p in professions:
    # Get neutralized vector
    w_clean = neutralize(p, gender_direction)
    # Re-calculate score
    score = torch.dot(w_clean, gender_direction).item()
    debiased_projections.append((p, score))

# Visualize Before vs After
plt.figure(figsize=(10, 8))
y_pos = np.arange(len(words))

plt.barh(y_pos - 0.2, scores, height=0.4, label='Original', color='gray', alpha=0.5)
plt.barh(y_pos + 0.2, [s for _, s in debiased_projections], height=0.4, label='Debiased', color='green')

plt.yticks(y_pos, words)
plt.axvline(0, color='black')
plt.legend()
plt.title("Impact of Geometric Debiasing")
plt.xlabel("Gender Bias Score")
plt.show()

### Conclusion
The green bars should be extremely close to 0. We have successfully flattened the gender dimension for these words.

**Critical Thinking:** 
Does this solve the problem? 
* **Yes:** The vector similarity between "doctor" and "woman" is now mathematically neutral.
* **No:** Real-world bias is more complex than a single linear direction. There may be other hidden correlations (e.g., "doctor" might still be close to "football" while "nurse" is close to "softball"). 

Debiasing is an ongoing area of research, but auditing your models like this is the first step.