<a href="https://colab.research.google.com/github/cynthiacxzhang/ucb-nlp-code/blob/main/BERT_experimentation_phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BERT Experimentation Phase 1**
Note: all of this has to be transferred into a Jupyter notebook later, but Colab is easier for testing and experimentation

Note: add details for what this notebook does

Header Comment:

# Project Summary: Gender Semantic Change in News Media with BERT

## Objective
Quantitatively measure how the meaning and representation of gender has shifted in Washington Post articles over time, using BERT-based contextual embeddings and a set of carefully curated gender antonym pairs.

## Pipeline Steps

1. **Data Cleaning & Preparation**
    - Start with preprocessed articles, ensure all text is lowercase, free of underscores and punctuation (except gender stopwords; only `.`, `?`, and `!` removed).
    - Remove all stopwords except gender-denoting terms (see antonym pairs list).
    - Store cleaned text in `paragraphs_clean`.

2. **Time Period Splitting**
    - Divide corpus into two periods:
        - Period 1: 1977-1991
        - Period 2: 2010-2024
    - Save each as a separate cleaned dataset.

3. **Model Fine-Tuning**
    - Fine-tune two independent BERT models (e.g., `bert-base-uncased`) on each time periods articles using masked language modeling (MLM).
    - Save each model for downstream analysis.

4. **Word Embedding Extraction**
    - For each BERT model, extract vector embeddings for all gender antonym words (e.g., `he/she`, `man/woman`, ...).

5. **Procrustes Alignment**
    - Use orthogonal Procrustes to align the vector spaces of the two models for direct comparison.

6. **Semantic Axis Construction**
    - For each antonym pair: compute the difference vector (female - male).
    - Average these difference vectors to get the **gender semantic axis** for each time period.

7. **Analysis**
    - Compute cosine similarity between each individual antonym axis across the two periods.
    - Compute cosine similarity between the average (full) gender axis across periods.
    - Output results as a CSV: one row per pair and one for the average.

8. **(Optional) Extensions**
    - Analyze relationship between gender axis and other topics (e.g., sports, business).
    - Find nearest words/names to each gender pole.

## Deliverables

- Cleaned, split datasets for each period
- Two fine-tuned BERT models (period-specific)
- Procrustes-aligned embedding spaces
- Cosine similarities for each gender antonym pair and the aggregate axis
- Exported results as a CSV file for reporting

## Research Question

**How has the semantic meaning and associations of gender, as measured by gender semantic axes in news article language, changed from 1977-1991 to 2010-2024?**


In [None]:
# mounting google drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install pytorch-pretrained-bert

Collecting pytorch-pretrained-bert
  Downloading pytorch_pretrained_bert-0.6.2-py3-none-any.whl.metadata (86 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting boto3 (from pytorch-pretrained-bert)
  Downloading boto3-1.39.17-py3-none-any.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=0.4.1->pytorch-pretrained-bert)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=0.4.1->pytorch-pretrained-bert)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=0.4.1->pytorch-pretrained-bert)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64

# 1. Data Preprocessing

In [None]:
# Data Imports

import pandas as pd
import numpy as np
import time
import datetime
import random
import os

In [None]:
# NTLK Setup

# ensure nltk resources are downloaded (run once)
import re

import nltk
from nltk.corpus import stopwords   # natural language toolkit
from nltk.tokenize import word_tokenize

# run ntlk resources once - working check
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # Download the missing resource

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# load data - from WaPo Analysis folder, file after paragraph preprocessing for NLP

# current data for rough notebook is stored in drive/research/"file"
filepath = "/content/drive/MyDrive/UCB_Research/2_5pct_1977_to_2024.parquet"

# Create a CSV path next to the parquet file
csv_path = os.path.splitext(filepath)[0] + ".csv"

# Read parquet to csv - only once!!
# df = pd.read_parquet(filepath)   # requires pyarrow (available by default in Colab)
# df.to_csv(csv_path, index=False)

# print(f"Done! CSV saved to: {csv_path}")

csv_path = "/content/drive/MyDrive/UCB_Research/2_5pct_1977_to_2024.csv"


In [None]:
df = pd.read_csv(csv_path)
print(df.columns)

Index(['publish_date', 'id', 'source', 'section', 'title', 'kicker', 'blurb',
       'subhead', 'credit', 'edition', 'page', 'language', 'authors',
       'paragraphs', 'captions', 'year', 'month', 'day', 'paragraph_length',
       'paragraphs_no_illeg', 'paragraphs_clean', 'section_new'],
      dtype='object')


In [None]:
df.head()

Unnamed: 0,publish_date,id,source,section,title,kicker,blurb,subhead,credit,edition,...,authors,paragraphs,captions,year,month,day,paragraph_length,paragraphs_no_illeg,paragraphs_clean,section_new
0,1977-04-03 00:00:00.050,archive/outlook/1977/04/03/gromykos-complaint/...,The Washington Post,Outlook,Gromyko's Complaint,,,,,M2,...,,"FOR A MAN who, Nikita Khruschev once said, wou...",,1977,4,3,562,"FOR A MAN who, Nikita Khruschev once said, wou...",man nikita khruschev said would sit cake ice c...,opinion editorial
1,1977-10-07 00:00:00.040,archive/style/1977/10/07/when-theres-frost-on-...,The Washington Post,Style,When There's Frost on the Green Tomatoes . . .,,,,,M2,...,Tom Stevenson,The first heavy frost will kill tomato plants ...,,1977,10,7,517,The first heavy frost will kill tomato plants ...,first heavy frost kill tomato plants garden ru...,life & leisure
2,1977-09-19 00:00:00.040,archive/a-section/1977/09/19/dutch-sextuplets/...,The Washington Post,A Section,Dutch Sextuplets,,,,From staff reports and news dispatches,M2,...,,A Dutch woman gave birth to sextuplets and hos...,,1977,9,19,76,A Dutch woman gave birth to sextuplets and hos...,dutch woman gave birth sextuplets hospital off...,news
3,1977-12-22 00:00:00.050,archive/sports/1977/12/22/holtz-suspends-three...,The Washington Post,Sports,Holtz Suspends Three Arkansas Starters,,,,Washington Post Staff Writer,M2,...,Byron Rosen,Coach Lou Holtz made no bones about jobbying -...,,1977,12,22,338,Coach Lou Holtz made no bones about jobbying -...,coach lou holtz made bones jobbying help oklah...,sports
4,1977-05-05 00:00:00.040,archive/a-section/1977/05/05/mengistu-in-mosco...,The Washington Post,A Section,Mengistu in Moscow,,,,From staff reports and news dispatches,M2,...,,"Ethiopian head of state Mengistu Haile Mariam,...",,1977,5,5,78,"Ethiopian head of state Mengistu Haile Mariam,...",ethiopian head state mengistu haile mariam who...,news


In [None]:
df

Unnamed: 0,publish_date,id,source,section,title,kicker,blurb,subhead,credit,edition,...,paragraphs,captions,year,month,day,paragraph_length,paragraphs_no_illeg,paragraphs_clean,section_new,paragraphs_cleaned
0,1977-04-03 00:00:00.050,archive/outlook/1977/04/03/gromykos-complaint/...,The Washington Post,Outlook,Gromyko's Complaint,,,,,M2,...,"FOR A MAN who, Nikita Khruschev once said, wou...",,1977,4,3,562,"FOR A MAN who, Nikita Khruschev once said, wou...",man nikita khruschev said would sit cake ice c...,opinion editorial,"man , nikita khruschev said , would sit cake i..."
1,1977-10-07 00:00:00.040,archive/style/1977/10/07/when-theres-frost-on-...,The Washington Post,Style,When There's Frost on the Green Tomatoes . . .,,,,,M2,...,The first heavy frost will kill tomato plants ...,,1977,10,7,517,The first heavy frost will kill tomato plants ...,first heavy frost kill tomato plants garden ru...,life & leisure,first heavy frost kill tomato plants garden ru...
2,1977-09-19 00:00:00.040,archive/a-section/1977/09/19/dutch-sextuplets/...,The Washington Post,A Section,Dutch Sextuplets,,,,From staff reports and news dispatches,M2,...,A Dutch woman gave birth to sextuplets and hos...,,1977,9,19,76,A Dutch woman gave birth to sextuplets and hos...,dutch woman gave birth sextuplets hospital off...,news,dutch woman gave birth sextuplets hospital off...
3,1977-12-22 00:00:00.050,archive/sports/1977/12/22/holtz-suspends-three...,The Washington Post,Sports,Holtz Suspends Three Arkansas Starters,,,,Washington Post Staff Writer,M2,...,Coach Lou Holtz made no bones about jobbying -...,,1977,12,22,338,Coach Lou Holtz made no bones about jobbying -...,coach lou holtz made bones jobbying help oklah...,sports,coach lou holtz made bones jobbying - help okl...
4,1977-05-05 00:00:00.040,archive/a-section/1977/05/05/mengistu-in-mosco...,The Washington Post,A Section,Mengistu in Moscow,,,,From staff reports and news dispatches,M2,...,"Ethiopian head of state Mengistu Haile Mariam,...",,1977,5,5,78,"Ethiopian head of state Mengistu Haile Mariam,...",ethiopian head state mengistu haile mariam who...,news,"ethiopian head state mengistu haile mariam , w..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85485,2003-11-09 00:00:00.050,archive/style/2003/11/09/culture-club/577ff5e9...,The Washington Post,Style,Culture Club,,,Republic Gardens' New Owners Are Looking for t...,Washington Post Staff Writer,M2,...,He should have been impressed by the club's ne...,"Bert Robinson, right, co-owner of the recently...",2003,11,9,3115,He should have been impressed by the club's ne...,he impressed clubs new flat screen monitors le...,life & leisure,he impressed club 's new flat-screen monitors ...
85486,2003-07-31 00:00:00.040,archive/sports/2003/07/31/redskins-look-to-res...,The Washington Post,Sports,Redskins Look To Re-Sign Bailey,,,Deal May Be Set Before Season Opens,Washington Post Staff Writer,M2,...,Washington Redskins owner Daniel Snyder has to...,,2003,7,31,682,Washington Redskins owner Daniel Snyder has to...,washington redskins owner daniel snyder told c...,sports,washington redskins owner daniel snyder told c...
85487,2003-02-16 00:00:00.050,archive/extra-prince-william/2003/02/16/snow-t...,The Washington Post,Extra - Prince William,Snow Threat Ices Friday Basketball,,,Postponements Portend a Busy Week,Washington Post Staff Writer,M2,...,Throughout Prince William County on Friday nig...,,2003,2,16,684,Throughout Prince William County on Friday nig...,throughout prince william county friday night ...,local news,throughout prince william county friday night ...
85488,2003-09-25 00:00:00.040,archive/metro/2003/09/25/richard-estep-dick-la...,The Washington Post,Metro,Richard Estep 'Dick' Lankford ...,,,,,M2,...,Richard Estep 'Dick' Lankford_Maryland Congres...,,2003,9,25,2158,Richard Estep 'Dick' Lankford_Maryland Congres...,richard estep dick lankford maryland congressm...,local news,richard estep 'dick ' lankford maryland congre...


### **Paragraph Filtering**

- remove periods, commas, sentence-ending punctuation
- create a semantic axis (later on?)

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# NLTK "english" stop words
default_stop = set(stopwords.words('english'))

# Your gender antonym pairs -- flatten this to a set
antonym_pairs = [
    ("he", "she"), ("him", "her"), ("his", "her"), ("man", "woman"), ("men", "women"),
    ("male", "female"), ("himself", "herself"), ("hes", "shes"), ("son", "daughter"),
    ("sons", "daughters"), ("brother", "sister"), ("brothers", "sisters"),
    ("father", "mother"), ("fathers", "mothers"), ("boy", "girl"), ("boys", "girls"),
    ("husband", "wife"), ("husbands", "wives")
]
# Get all unique terms (flatten the pair list)
gender_stopwords = set([item for sublist in antonym_pairs for item in sublist])

# Remove gendered words from stopword set
filtered_stop = default_stop - gender_stopwords

def clean_text(text):
    """
    Text cleaning for gender analysis.
    - Lowercase
    - Remove underscores (_)
    - Remove specified punctuation (., ?, !)
    - Remove numbers
    - Remove stop words *except* for all gender-denoting words in antonym list
    """
    if pd.isna(text):
        return ""
    text = text.lower()
    text = text.replace("_", " ")
    text = re.sub(r'[.?!]', '', text)  # Only remove . ? !
    text = re.sub(r'[0-9]', '', text)  # Remove digits
    # Optionally, remove other unwanted characters here if needed

    words = word_tokenize(text)
    words = [word for word in words if word not in filtered_stop]
    return " ".join(words)


In [None]:
# apply cleaning to "paragraphs" - sanity check

df["paragraphs_cleaned"] = df["paragraphs"].apply(clean_text)

### **Time Period Work**

Splitting corpus by custom time periods (first 30% and last 30%)
* leave black box in the middle for contextual analysis

In [None]:
## NOTE: +-15 years


# If not already extracted, make sure 'year' exists:
df['year'] = pd.to_datetime(df['publish_date'], errors='coerce').dt.year

# Remove rows with invalid/missing years
df = df.dropna(subset=['year'])

# Compute 30th and 70th percentiles of year
q_30, q_70 = df['year'].quantile([0.3, 0.7])

# Select first 30% years
first_30 = df[df['year'] <= q_30]

# Select last 30% years
last_30 = df[df['year'] >= q_70]

# For reporting purposes, get the actual year ranges:
print("First 30% period (years):", int(first_30['year'].min()), "-", int(first_30['year'].max()))
print("Last 30% period (years):", int(last_30['year'].min()), "-", int(last_30['year'].max()))
print("First 30% articles:", len(first_30))
print("Last 30% articles:", len(last_30))

# Save for downstream tasks
first_30.to_csv("corpus_first_30pct.csv", index=False)
last_30.to_csv("corpus_last_30pct.csv", index=False)


First 30% period (years): 1977 - 1984
Last 30% period (years): 1995 - 2003
First 30% articles: 25812
Last 30% articles: 28388


In [None]:
# ERROR: MISSING DATA FROM 2003 ONWARDS

# # Example: split by publish year (assuming there's a 'publish_date' or 'year' column)
# df['year'] = pd.to_datetime(df['publish_date']).dt.year

# # Define time periods
# period1 = df[(df['year'] >= 1977) & (df['year'] <= 1991)]
# period2 = df[(df['year'] >= 2010) & (df['year'] <= 2024)]

# # Save/inspect splits
# period1.to_csv("corpus_1977_1991.csv", index=False)
# period2.to_csv("corpus_2010_2024.csv", index=False)

In [None]:
# ERROR: MISSING DATA FROM 2003 ONWARDS

# # debugging check 1

# print("Period 1 (1977–1991):", len(period1), "articles")
# print("Date range:", period1["year"].min(), "-", period1["year"].max())
# print("Period 2 (2010–2024):", len(period2), "articles")
# print("Date range:", period2["year"].min(), "-", period2["year"].max())

Period 1 (1977–1991): 48047 articles
Date range: 1977 - 1991
Period 2 (2010–2024): 0 articles
Date range: nan - nan


In [None]:
# ERROR: MISSING DATA FROM 2003 ONWARDS

# # debugging check 2

# # Check some sample rows:
# print(period1["paragraphs_clean"].sample(3).to_list())
# print(period2["paragraphs_clean"].sample(3).to_list())

# # Optionally, check for any residual forbidden characters:
# import re
# def check_cleanliness(text_series):
#     prob_rows = text_series[text_series.str.contains(r'[_\.?!\d]', regex=True, na=False)]
#     return prob_rows

# # Find any problematic rows
# problems_p1 = check_cleanliness(period1["paragraphs_clean"])
# problems_p2 = check_cleanliness(period2["paragraphs_clean"])
# print("Problem rows in Period 1:", len(problems_p1))
# print("Problem rows in Period 2:", len(problems_p2))


In [None]:
import sys
import csv

# Increase the field size limit
csv.field_size_limit(sys.maxsize)

# Now try reading the CSVs again
df_1 = pd.read_csv("corpus_first_30pct.csv", engine='python')
df_2 = pd.read_csv("corpus_last_30pct.csv", engine='python')

period1 = df_1
period2 = df_2

print("First 30% articles:", len(period1))
print("Last 30% articles:", len(period2))

First 30% articles: 25812
Last 30% articles: 28388


# 2. Training

### **Start BERT Experimentation**

Fine-tuning BERT on the first and last time period after split

In [None]:
# Pre Imports

import torch
import matplotlib.pyplot as plt

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
#logging.basicConfig(level=logging.INFO)

In [None]:
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# Assuming period1 and period2 are Pandas DataFrames with 'paragraphs_cleaned'
train_dataset_p1 = Dataset.from_pandas(period1[['paragraphs_cleaned']])
train_dataset_p2 = Dataset.from_pandas(period2[['paragraphs_cleaned']])

def tokenize_function(example):
    return tokenizer(example["paragraphs_cleaned"], truncation=True, padding="max_length", max_length=128)

tokenized_p1 = train_dataset_p1.map(tokenize_function, batched=True)
tokenized_p2 = train_dataset_p2.map(tokenize_function, batched=True)

# Remove redundant columns to leave only tokenized features
tokenized_p1 = tokenized_p1.remove_columns(['paragraphs_cleaned'])
tokenized_p2 = tokenized_p2.remove_columns(['paragraphs_cleaned'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25812 [00:00<?, ? examples/s]

Map:   0%|          | 0/28388 [00:00<?, ? examples/s]

In [None]:
# Fine Tuning Per Period

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args_1 = TrainingArguments(
    output_dir="./bert_finetune_p1",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    save_steps=10000,
    overwrite_output_dir=True,
    logging_steps=500
)

# Fine-tune for first period
model_p1 = BertForMaskedLM.from_pretrained(model_name)
trainer_p1 = Trainer(
    model=model_p1,
    args=training_args_1,
    train_dataset=tokenized_p1,
    data_collator=data_collator,
)

# Run Training on Period 1
trainer_p1.train()
trainer_p1.save_model('./period_1_tuned')

# Fine-tune for second period
training_args_2 = TrainingArguments(
    output_dir="./bert_finetune_p2",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    save_steps=10000,
    overwrite_output_dir=True,
    logging_steps=500
)

model_p2 = BertForMaskedLM.from_pretrained(model_name)
trainer_p2 = Trainer(
    model=model_p2,
    args=training_args_2,
    train_dataset=tokenized_p2,
    data_collator=data_collator,
)

# Run Training on Period 2
trainer_p2.train()
trainer_p2.save_model('./period_2_tuned')


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcynthiacxzhang[0m ([33mcynthiacxzhang-university-of-california-berkeley[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return forward_call(*args, **kwargs)


Step,Training Loss
500,4.116
1000,3.8709
1500,3.7546


# 3. Semantic Axes Construction

In [None]:
# Extract gender antonym embeddings

import torch
import numpy as np
from transformers import BertTokenizer, BertModel

def get_static_embedding(word, model, tokenizer):
    # Context-free input for target word
    input_ids = tokenizer.encode(word, return_tensors="pt")
    with torch.no_grad():
        outputs = model(input_ids)
    # 1st token after [CLS], assuming single-word input
    return outputs.last_hidden_state[0, 1, :].cpu().numpy()

In [None]:
# Procrustes

from scipy.linalg import orthogonal_procrustes

# X1: matrix of vectors from Model 1 (earlier period)
# X2: matrix of vectors from Model 2 (later period)

R, _ = orthogonal_procrustes(X2, X1)
X2_aligned = X2 @ R

NameError: name 'X2' is not defined

In [None]:
# Cosine Similarity
# - function for implementation later

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

results = []
for (f, m) in antonym_pairs:
    axis_old = embed_old[f] - embed_old[m]
    axis_new = embed_new_aligned[f] - embed_new_aligned[m]
    sim = cosine_similarity(axis_old, axis_new)
    results.append({'pair': f"{f}/{m}", 'cosine_similarity': sim})

# Average/overall axis calculation
axis_old = np.mean([embed_old[f] - embed_old[m] for (f, m) in antonym_pairs], axis=0)
axis_new = np.mean([embed_new_aligned[f] - embed_new_aligned[m] for (f, m) in antonym_pairs], axis=0)
average_sim = cosine_similarity(axis_old, axis_new)
results.append({'pair': 'average_gender_axis', 'cosine_similarity': average_sim})

# Save as CSV
import pandas as pd
df_results = pd.DataFrame(results)
df_results.to_csv("gender_axis_cosine_similarities.csv", index=False)

In [None]:
# using the gendered pairs

# 4. Hyperparameter Testing

In [None]:
# apply cleaning to "paragraphs" - sanity check

df["paragraphs_cleaned"] = df["paragraphs"].apply(clean_text)

In [None]:
# Load the fine-tuned models
model_p1_tuned = BertModel.from_pretrained('./period_1_tuned', local_files_only=True)
model_p2_tuned = BertModel.from_pretrained('./bert_finetune_p2', local_files_only=True) # Use the output_dir from the training args

# Extract embeddings for gender_stopwords from each model
embed_old = {}
embed_new = {}

for word in gender_stopwords:
    embed_old[word] = get_static_embedding(word, model_p1_tuned, tokenizer)
    embed_new[word] = get_static_embedding(word, model_p2_tuned, tokenizer)

# Create matrices X1 and X2 for Procrustes alignment
# Ensure consistent order of words
word_list = list(gender_stopwords)
X1 = np.array([embed_old[word] for word in word_list])
X2 = np.array([embed_new[word] for word in word_list])

print("Shape of X1 (Period 1 embeddings):", X1.shape)
print("Shape of X2 (Period 2 embeddings):", X2.shape)

HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './period_1_tuned'.