<a href="https://colab.research.google.com/github/d-atallah/implicit_gender_bias/blob/main/pyspark_bias_calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Packages

In [6]:
import gensim.downloader as api
from gensim.matutils import cossim
from gensim.models import KeyedVectors
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import paired_euclidean_distances
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [7]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=c1c0a26829ea3210ca9a40b2e05b3573dfa364c58d5bedfe25480a7f2f0fad68
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [122]:
from pyspark.context import SparkContext
from pyspark.ml.feature import RegexTokenizer
from pyspark.sql.functions import explode
from pyspark.sql import SparkSession

## Load Files

**Annotations** contains crowdsourced annotations for response sentiment and relevance on source-response pairs obtained as described in the paper *RtGender: A Corpus for Studying Differential Responses to Gender* by Rob Voigt, David Jurgens, Vinodkumar Prabhakaran, Dan Jurafsky and Yulia Tsvetkov. Documentation is available [here](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fnlp.stanford.edu%2Frobvoigt%2Frtgender%2F).

In [2]:
file_path_annotations = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/annotations.csv'
file_path_googlenews = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/word2vec-google-news-300.model'
file_path_document_bias = '/content/drive/MyDrive/SIADS 696: Milestone II/Project/Data/RtGender/document-bias.csv'

In [12]:
spark = SparkSession.builder.getOrCreate()

In [116]:
dataframe_annotations = spark.read.csv(file_path_annotations, header=True)
dataframe_annotations.show(5, 0)

+-------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+---------+----------+
|source       |op_gender|post_text                                                                                                                                                                                                                   |response_text                                                           |sentiment|relevance |
+-------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------

**Google News** contains a pre-trained Word2Vec model based on the Google News dataset, covering approximately 3 million words and phrases. Documentation is available [here](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fcode.google.com%2Farchive%2Fp%2Fword2vec%2F) and [here](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fradimrehurek.com%2Fgensim%2Fauto_examples%2Ftutorials%2Frun_word2vec.html%23sphx-glr-auto-examples-tutorials-run-word2vec-py).

In [21]:
# model_googlenews = api.load('word2vec-google-news-300')
# model_googlenews.save(file_path_googlenews)

In [20]:
model_googlenews = KeyedVectors.load(file_path_googlenews, mmap='r')

Nouns from v1.1 of the **HolisticBias** dataset, a project of the Responsible Natural Language Processing team at Facebook Research. The dataset is described in the paper *I'm sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset* by Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. Documentation is available [here](https://github.com/facebookresearch/ResponsibleNLP/tree/main/holistic_bias/dataset/v1.1).

In [22]:
nouns = {
    "female": [
        ["woman", "women"],
        ["lady", "ladies"],
        ["gal", "gals"],
        ["girl", "girls"],
        ["mother", "mothers"],
        ["mom", "moms"],
        ["daughter", "daughters"],
        ["wife", "wives"],
        ["grandmother", "grandmothers"],
        ["grandma", "grandmas"],
        ["sister", "sisters"],
        ["sista", "sistas"]
    ],
    "male": [
        ["man", "men"],
        ["bro", "bros"],
        ["guy", "guys"],
        ["boy", "boys"],
        ["father", "fathers"],
        ["dad", "dads"],
        ["son", "sons"],
        ["husband", "husbands"],
        ["grandfather", "grandfathers"],
        ["grandpa", "grandpas"],
        ["brother", "brothers"]
    ],
    "neutral": [
        ["individual", "individuals"],
        ["person", "people"],
        ["kid", "kids"],
        ["parent", "parents"],
        ["child", "children"],
        ["spouse", "spouses"],
        ["grandparent", "grandparents"],
        ["sibling", "siblings"],
        ["veteran", "veterans"]
    ]
}

## Tokenize Text

The default in scikit-learn is r'(?u)\b\w\w+\b'. According to the CountVectorizer documentation, "Select[s] tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)."

In [132]:
tokenizer = RegexTokenizer(pattern = r'(?u)\b\w\w+\b', inputCol = 'response_text', outputCol = 'response_token', gaps = False)

In [135]:
dataframe_annotations_tokens = tokenizer.transform(dataframe_annotations)
dataframe_annotations_tokens.show()

+-------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       source|op_gender|           post_text|       response_text|           sentiment|           relevance|      response_token|
+-------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...|            Positive|              Poster|[you, are, both, ...|
|facebook_wiki|        M|Well guys, real p...|Give us the first...|               Mixed|             Content|[give, us, the, f...|
|facebook_wiki|        W|Tonight is going ...|this is my city w...|             Neutral|             Content|[this, is, my, ci...|
|facebook_wiki|        M|I know grandma Gi...|if grizzly Adams ...|             Neutral|             Content|[if, grizzly, ada...|
|facebook_wiki|        W|#NEWS to KNOW thi...|Good morning Lour...|            Posi

In [134]:
dataframe_annotations_tokens.withColumn('response_token', explode('response_token')).show()

+-------------+---------+--------------------+--------------------+---------+---------+--------------+
|       source|op_gender|           post_text|       response_text|sentiment|relevance|response_token|
+-------------+---------+--------------------+--------------------+---------+---------+--------------+
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|           you|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|           are|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|          both|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|         sweet|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|        ashley|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Positive|   Poster|       tisdale|
|facebook_wiki|        W|Stopped by Fashio...|You are Both Swee...| Posit