<a href="https://colab.research.google.com/github/d-atallah/implicit_gender_bias/blob/main/word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Packages

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=67f94789404114129f8cf4b2b4840aba49deeecffbca3156e62b7e3794da6553
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [3]:
import json
import os

from joblib import Parallel, delayed
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from pyspark.context import SparkContext
from pyspark.ml.feature import Word2Vec
from pyspark.sql import Row, SparkSession
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Load Files

These files contain a sample of **social media posts** from the paper *RtGender: A Corpus for Studying Differential Responses to Gender* by Rob Voigt, David Jurgens, Vinodkumar Prabhakaran, Dan Jurafsky and Yulia Tsvetkov. Documentation is available [here](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fnlp.stanford.edu%2Frobvoigt%2Frtgender%2F). The sample includes an equal number of posts from the five data sources balanced on the gender of the original poster. Replacement was used to ensure less robust sources are adequately represented.

In [None]:
filepath = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/sample'
filepath_train = os.path.join(filepath, 'train_one_million.csv')
filepath_validate = os.path.join(filepath, 'validate_one_million.csv')
filepath_test = os.path.join(filepath, 'test_one_million.csv')

In [None]:
# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data, rename columns, drop nulls and a specific column
dataframe_train = (
    spark
    .read
    .csv(filepath_train, header=True)
    .withColumnRenamed(' op_gender', 'op_gender')
    .withColumnRenamed(' response_text', 'response_text')
    .dropna()
    .drop('stratify')
)

# Display the first five rows of the DataFrame
dataframe_train.show(5)

+------+---------+---------+--------------------+
|source|source_id|op_gender|       response_text|
+------+---------+---------+--------------------+
|   FIT|FIT262793|        M|oure welcome. Hop...|
|   FIT|FIT203911|        M|           Thank you|
|   TED|TED129836|        M|As someone that w...|
|   TED|TED101922|        M|                Neat|
|   RED|RED166641|        W|Seriously. My bat...|
+------+---------+---------+--------------------+
only showing top 5 rows



This file contains the **stop words** available in the Natural Language Toolkit. Gendered pronouns have been removed.

In [None]:
filepath_stopwords = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/stop_words.txt'

In [None]:
# Load stopwords
with open(filepath_stopwords, 'r') as file:
    stopwords = json.load(file)['stop_words']

This file contains **nouns** from the HolisticBias dataset, a project of the Responsible Natural Language Processing team at Facebook Research. The dataset is described in the paper *I'm sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset* by Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. Documentation is available [here](https://github.com/facebookresearch/ResponsibleNLP/tree/main/holistic_bias/dataset/v1.1).

In [None]:
filepath_nouns = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/gendered_nouns.txt'
filepath_pronouns = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/gendered_pronouns.txt'

In [None]:
# Load gendered nouns
with open (filepath_nouns, 'r') as file:
    nouns = json.load(file)

nouns_male = ' '.join([item for sublist in nouns['male'] for item in sublist])
nouns_female = ' '.join([item for sublist in nouns['female'] for item in sublist])

This file contains **pronouns** from Grammarly as described in the article *A Guide to Personal Pronouns and How They've Evolved*. The article includes additional neopronouns, pronouns that “refer to people entirely without reference to gender” (Grammarly, 2021). Documentation is available [here](https://www.grammarly.com/blog/gender-pronouns/).

In [None]:
# Load gendered pronouns
with open(filepath_pronouns, 'r') as file:
    pronouns = json.load(file)

pronouns_male = ' '.join(pronouns['male'])
pronouns_female = ' '.join(pronouns['female'])

# Tokenize Text

In [None]:
class LemmaTokenizer:
    """
    A tokenizer class that optionally applies NLTK's WordNetLemmatizer to both tokens and stop words,
    and removes stop words based on a custom JSON file. The class can be configured to perform
    lemmatization, stop word removal, both, or neither, ensuring consistency between token and stop word
    processing. Based on code developed by Daniel Atallah.
    """

    def __init__(self, use_lemmatization=False, remove_stopwords=False, stopwords_file=None):
        """
        Initializes the LemmaTokenizer instance with options for lemmatization and stop word removal,
        and loads (and optionally lemmatizes) stop words from a specified JSON file if stop word removal
        is enabled.
        """
        self.use_lemmatization = use_lemmatization
        self.remove_stopwords = remove_stopwords and stopwords_file is not None
        self.lemmatizer = WordNetLemmatizer() if use_lemmatization else None
        self.tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
        self.stop_words = self._load_stopwords(stopwords_file) if self.remove_stopwords else set()

    def _load_stopwords(self, stopwords_file):
        """
        Loads and optionally lemmatizes stop words from a JSON file.
        """
        with open(stopwords_file, 'r') as file:
            stopwords = set(json.load(file))
        if self.use_lemmatization:
            return {self.lemmatizer.lemmatize(word) for word in stopwords}
        return stopwords

    def __call__(self, text):
        """
        Tokenizes and optionally lemmatizes and removes stop words from the input text.
        """
        tokens = self.tokenizer.tokenize(text)
        if self.use_lemmatization:
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        if self.remove_stopwords:
            tokens = [token for token in tokens if token.lower() not in self.stop_words]
        return tokens

In [None]:
def apply_tokenizer(row, text_column, tokenizer):
    """
    Tokenizes text in a specific column of a row.

    Parameters:
    - row: The row containing the text to tokenize.
    - text_column: The name of the column containing the text.
    - tokenizer: An instance of a tokenizer class, used to tokenize the text.

    Returns:
    A new Row object with the original content and an additional 'tokens' field.
    """
    # Tokenize the content of the specified text column
    tokens = tokenizer(row[text_column])

    # Return a new Row with the original row data and the new 'tokens' field
    return Row(**row.asDict(), response_tokens=tokens)

In [None]:
# Instantiate tokenizer
tokenizer = LemmaTokenizer()

In [None]:
# Tokenize dataframe
rdd_train = dataframe_train.rdd.map(lambda row: apply_tokenizer(row, 'response_text', tokenizer)).toDF()

In [None]:
# Tokenize gendered nouns and pronouns
tokens_nouns_male = tokenizer(nouns_male)
tokens_nouns_female = tokenizer(nouns_female)
tokens_pronouns_male = tokenizer(pronouns_male)
tokens_pronouns_female = tokenizer(pronouns_female)

# Train Model

In [None]:
# Instantiate Word2Vec in PySpark
wtv = Word2Vec(inputCol='response_tokens', outputCol='model', numPartitions=4)

In [None]:
# Fit data
model = wtv.fit(rdd_train)

In [None]:
model.getVectors().show()

+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|    professed|[9.20962076634168...|
|    pathogens|[0.05213452130556...|
|  advertencia|[-0.0230888389050...|
|     quotient|[0.02727679535746...|
|     incident|[-0.1272087246179...|
|synchronicity|[-0.0107862930744...|
|         buns|[-0.0331519544124...|
|      serious|[0.07881371676921...|
|        brink|[0.03099950402975...|
|         9/11|[0.10745714604854...|
|      acronym|[-0.0113178882747...|
|    foolproof|[0.03159939497709...|
|     youthful|[-0.0086543830111...|
|     sinister|[0.02207533083856...|
|       comply|[0.12050931155681...|
|        u0623|[0.07430828362703...|
|       breaks|[0.08198492228984...|
|        mesmo|[-0.0525239780545...|
|    subreddit|[-0.1006804108619...|
|          dns|[-0.0173437874764...|
+-------------+--------------------+
only showing top 20 rows



In [None]:
model.save('/content/drive/MyDrive/SIADS 696: Milestone II/Project/Models/initial_pyspark.model')

In [None]:
model.transform(rdd_train).head().model

DenseVector([0.0221, -0.0357, -0.1232, -0.0486, -0.1217, 0.0534, -0.2227, -0.1274, -0.066, -0.0054, -0.1524, -0.2124, -0.1227, 0.1874, 0.0208, 0.1111, -0.1775, 0.1581, 0.0088, 0.0062, -0.2181, -0.2011, 0.1006, 0.1279, 0.0911, -0.1285, -0.0128, 0.0132, 0.1514, -0.0339, 0.1665, -0.0664, 0.0891, -0.1419, 0.0577, -0.1358, -0.0763, -0.0473, -0.0999, 0.0603, 0.1182, 0.0541, -0.152, -0.0483, -0.0233, 0.05, -0.0262, 0.2023, 0.0367, -0.0621, 0.0328, 0.0207, -0.1272, -0.056, -0.0632, 0.0667, -0.0892, -0.09, 0.1078, -0.0839, 0.0822, 0.1061, -0.1389, -0.0053, 0.0502, -0.0267, 0.0864, -0.1637, -0.0572, -0.0763, -0.0304, 0.1591, 0.0447, 0.0632, -0.1084, -0.0133, -0.0009, 0.1762, 0.0375, 0.017, 0.0808, -0.0007, 0.0192, -0.0021, 0.0728, -0.1267, -0.1041, -0.0186, -0.0079, 0.0266, -0.0586, 0.1526, -0.0376, 0.0623, 0.0961, -0.153, -0.1867, -0.1506, -0.1198, -0.0786])

# Calculate Bias

Garg et al. (2018) use a different approach to assess the similarity between a set of neutral words and two groups, first subtracting the distance between each group and a neutral word, then summing the results across words. This approach gives equal weight to each word, unlike the approach below. Documentation is available [here](https://pubmed.ncbi.nlm.nih.gov/29615513/).

In [None]:
import numpy as np
import pandas as pd

def add_bias(dataframe, token_column, male_tokens, female_tokens, model):
    """
    Calculate bias scores for text data in a DataFrame based on the difference in distances
    from male and female token embeddings.

    Parameters:
    - dataframe (pd.DataFrame): DataFrame containing the text data.
    - token_column (str): Column name containing the lists of tokens.
    - male_tokens (list of str): List of tokens associated with male attributes.
    - female_tokens (list of str): List of tokens associated with female attributes.
    - model: Model with a `get_mean_vector` method to compute embeddings.

    Returns:
    - pd.DataFrame: DataFrame with an additional 'bias' column.
    """
    # Compute embeddings and bias scores directly without intermediate columns
    male_vector = model.get_mean_vector(male_tokens)
    female_vector = model.get_mean_vector(female_tokens)

    def calculate_bias(tokens):
        # Ensure the tokens are passed correctly to the model's method
        embedding = model.get_mean_vector(tokens)
        bias = np.linalg.norm(male_vector - embedding) - np.linalg.norm(female_vector - embedding)
        return bias

    # Apply the combined operation, ensuring tokens are passed correctly to calculate_bias
    dataframe['bias'] = dataframe[token_column].apply(calculate_bias)

    return dataframe

In [None]:
add_bias(dataframe_train, 'response_text', male_pronouns, female_pronouns, model.wv)

Unnamed: 0,source,source_id,op_gender,response_text,stratify,tokens,bias
0,TED,TED5828,W,Beautiful... If only more people could see thi...,TEDW,"[beautiful, ..., if, only, more, people, could...",-0.054628
1,RED,RED906982,M,Idk man. Cubs striking has looked NASTY lately...,REDM,"[idk, man, ., cubs, striking, has, looked, nas...",-0.053548
2,FBW,FBW3327456,W,Having a hard time right now!! Im right there😥,FBWW,"[having, a, hard, time, right, now, !, !, im, ...",-0.057289
3,FIT,FIT189959,W,Welcome! Fitocracy is a great place to track y...,FITW,"[welcome, !, fitocracy, is, a, great, place, t...",-0.052358
4,RED,RED650418,W,Ich hab Sims4 deinstalliert nachdem ich eine H...,REDW,"[ich, hab, sims, 4, deinstalliert, nachdem, ic...",-0.053050
...,...,...,...,...,...,...,...
3499995,RED,RED382010,M,Bloody sex is the best sex behind Chloroform sex.,REDM,"[bloody, sex, is, the, best, sex, behind, chlo...",-0.054982
3499996,FBW,FBW4664155,W,Adorable,FBWW,[adorable],-0.054897
3499997,RED,RED1203857,W,I have been to 2 gynecologists. I have no phy...,REDW,"[i, have, been, to, 2, gynecologists, ., i, ha...",-0.051896
3499998,FIT,FIT21012,M,hahaahaah,FITM,[hahaahaah],-0.045187


This function uses Euclidean distance instead of cosine similarity. The advantage of using cosine similarity is that the distance between vectors is normalized. However, because the number of male and female nouns in the HolisticBias dataset is similar, it is not necessary to use a normalized measure, particularly if computational efficiency is compromised. Garg et al. (2018) also use Euclidean distance.

# Reduce Dimensions

In [None]:
def train_vectorizer(text_data, vectorizer=TfidfVectorizer, tokenizer=TweetTokenizer()):
    """
    Trains a vectorizer on the provided text data and returns the vectorizer instance,
    the document-term matrix, and the feature names.

    Parameters:
    - text_data: List of text documents to be vectorized.
    - vectorizer: Vectorizer class to be used for text vectorization. Defaults to CountVectorizer.
    - tokenizer: Tokenizer class to be used for tokenizing the text documents. Defaults to TweetTokenizer.

    Returns:
    - instance: The trained vectorizer instance.
    - matrix: The document-term matrix resulting from fitting the vectorizer on `text_data`.
    - features: An array of feature names generated by the vectorizer.
    """
    # Initialize the vectorizer with specified configurations
    instance = vectorizer(
        strip_accents=None,  # Do not strip accents
        lowercase=False,  # Do not convert characters to lowercase
        tokenizer=tokenizer.tokenize,  # Use the tokenize method of the tokenizer instance
        token_pattern=None,  # Since a tokenizer is provided, token_pattern is not used
        stop_words=list(stop_words),  # Do not remove stop words
        ngram_range=(1, 1),  # Consider only single words (1-grams)
        min_df=0.01,  # Minimum document frequency for filtering terms
        max_df=0.99,  # Maximum document frequency for filtering terms
        max_features=None  # No limit on the number of features
    )

    # Fit the vectorizer on the provided text data and transform the data into a matrix
    matrix = instance.fit_transform(text_data)

    # Retrieve the feature names generated by the vectorizer
    features = instance.get_feature_names_out()

    return instance, matrix, features

In [None]:
def train_svd(matrix, n_components=2, random_state=42):
    """
    Trains a Truncated Singular Value Decomposition (SVD) model on the given matrix.

    Parameters:
    - matrix: The input matrix to decompose.
    - n_components: Number of components to keep.
    - random_state: Seed for the random number generator.

    Returns:
    - A tuple containing the trained SVD model, term-topic matrix, document-topic matrix,
      and array of singular values.
    """
    svd = TruncatedSVD(n_components=n_components, random_state=random_state)
    model = svd.fit(np.transpose(matrix))
    term_topic_matrix = svd.transform(np.transpose(matrix))
    document_topic_matrix = svd.components_
    singular_values = svd.singular_values_

    return model, term_topic_matrix, document_topic_matrix, singular_values

# Visualize Data

In [None]:
def plot_hist(dataframe, gender_column='op_gender', bias_column='bias'):

    fig, ax = plt.subplots()

    ax.hist(dataframe[dataframe[gender_column] == 'M'][bias_column], bins=100, density=True, alpha=0.5, label='Original Poster Male')
    ax.hist(dataframe[dataframe[gender_column] == 'W'][bias_column], bins=100, density=True, alpha=0.5, label='Original Poster Female')

    ax.set_title('Response Bias')
    ax.set_xlabel('Calculated Bias')
    ax.set_ylabel('Density')
    ax.legend()

    fig.show()

In [None]:
def plot_svd(document_topic_matrix, mask):
    """
    Plots the SVD (Singular Value Decomposition) results, separating points by gender based on a mask.

    Parameters:
    - document_topic_matrix: The document-topic matrix obtained from SVD.
    - mask: An array of gender labels ('M' for male, 'W' for female) for each document.

    Returns:
    - None
    """
    mask_male = np.where(mask == 'M', True, False)
    mask_female = np.where(mask == 'W', True, False)

    fig, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True, tight_layout=True)

    axs[0].scatter(document_topic_matrix[0][mask_male],
                   document_topic_matrix[1][mask_male],
                   alpha=0.1, color='C0')
    axs[0].set_title('Original Poster Male')

    axs[1].scatter(document_topic_matrix[0][mask_female],
                   document_topic_matrix[1][mask_female],
                   alpha=0.1, color='C1')
    axs[1].set_title('Original Poster Female')

    for ax in axs:
        ax.set_xlabel('Principal Component 1')
        ax.set_ylabel('Principal Component 2')

    plt.show()

# References

"Please annotate the following code and convert it into PEP 8." OpenAI. (2023). ChatGPT (Jan 30 version) [Large language model]. https://chat.openai.com/chat

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS, 115(16). https://doi.org/10.1073/pnas.1720347115