# Data Science for Social Justice Workshop: Word Embeddings and Language Bias - PROJECT

### Icons Used in This Notebook
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

In [None]:
# Package imports
import os
import pandas as pd
import numpy as np

import pickle
from gensim.models import Word2Vec
import multiprocessing

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
# Replace this with your own pickle file!
with open('../../data/YOUR_FILE.pickle', 'rb') as f:
    posts = pickle.load(f)

In [None]:
# Split into lists of words for Word2Vec
post_list = [post.split() for post in posts]
post_list[0]

# Constructing a Word2Vec Model

In [None]:
cores = multiprocessing.cpu_count() # Number of cores at your disposal

n_features = 300     # Word vector dimensionality (how many features each word will be given)
min_word_count = 10  # Minimum word count to be taken into account
n_workers = cores    # Number of threads to run in parallel (equal to your amount of cores)
window = 5           # Context window size
downsampling = 1e-2  # Downsample setting for frequent words
seed = 1             # Seed for the random number generator (to create reproducible results)
sg = 1               # Skip-gram = 1, CBOW = 0
epochs = 20          # Number of iterations over the corpus

model = Word2Vec(
    sentences=post_list,
    workers=n_workers,
    vector_size=n_features,
    min_count=min_word_count,
    window=window,
    sample=downsampling,
    seed=seed,
    sg=sg)

In [None]:
# Save the model to disk
model.save('../../data/embeddings.emb')

In [None]:
# Load the model from disk
model = Word2Vec.load('../../data/embeddings.emb')

How many terms are in your vocabulary?

In [None]:
len(model.wv)

# Word Similarity


In [None]:
def get_most_similar_terms(model, token, topn=20):
    """Look up the top N most similar terms to the token."""
    for word, similarity in model.wv.most_similar(positive=[token], topn=topn):
        print(f"{word}: {round(similarity, 3)}")

In [None]:
get_most_similar_terms(model, 'CHOOSE_WORD')

# Visualizing High Dimensional Spaces with $t$-SNE

Change `words` to include words you are interested in for your data (make sure they appear in your dataset!) in order to visualize their relations. You can make this list as long or short as you want.

💭 **Reflection**: Figuring out **which words** you are interested in exploring is one of the main challenges when doing work like this! It will depends on your subreddit and your research questions. Discuss this with your group.

In [None]:
words = ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5', 'WORD6', 'WORD7', 'WORD8']

# Extract the word vectors
word_vectors = np.array([model.wv[word] for word in words])

In [None]:
# If you get an ImportError in the line tsne=TSNE(), you might need to install scikit-learn:
# %pip install -U scikit-learn 

In [None]:
# Reduce dimensionality using t-SNE
tsne = TSNE(n_components=2, random_state=2, perplexity=2)
reduced_vectors = tsne.fit_transform(word_vectors)

In [None]:
# Store the t-SNE vectors
words_df = pd.DataFrame(reduced_vectors,
                            index=pd.Index([word for word in words]),
                            columns=['x', 'y'])

In [None]:
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, LabelSet

output_notebook()
bokeh.io.output_notebook()

In [None]:
# Add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(words_df)

# Create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings')

# Add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips='@index'))

# Draw the words as circles on the plot
tsne_plot.circle('x', 'y',
                 source=plot_data,
                 color='blue',
                 size=10,
                 hover_line_color='black')

# Add labels to the points
labels = LabelSet(x='x', y='y', text='index', level='glyph',
                  x_offset=5, y_offset=5, source=plot_data,
                  render_mode='canvas')
tsne_plot.add_layout(labels)

# Engage!
show(tsne_plot)

Now let's use $t$-SNE to take **all** the word vectors.

In [None]:
tsne = TSNE(init='pca', learning_rate='auto')
X_tsne = tsne.fit_transform(model.wv.vectors)

In [None]:
# Store the t-SNE vectors
tsne_df = pd.DataFrame(X_tsne,
                            index=pd.Index(model.wv.index_to_key),
                            columns=['x', 'y'])

In [None]:
# Create some filepaths to save our model
tsne_path = '../../data/tsne_model'
tsne_df_path = '../../data/tsne_df.pkl'

In [None]:
# Save to disk
with open(tsne_path, 'wb') as f:
    pickle.dump(X_tsne, f)

tsne_df.to_pickle(tsne_df_path)

Here's a convenient code block to load this data, to start from this point:

In [None]:
with open(tsne_path, 'rb') as f:
    X_tsne = pickle.load(f)
    
tsne_df = pd.read_pickle(tsne_df_path)

Visualize with `bokeh`.

In [None]:
# Add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_df)

# Create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings')

# Add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips='@index') )

# Draw the words as circles on the plot
tsne_plot.circle('x', 'y',
                 source=plot_data,
                 color='blue',
                 line_alpha=0.2,
                 fill_alpha=0.1,
                 size=10,
                 hover_line_color='black')

# Engage!
show(tsne_plot)

# Language Biases and Word Embeddings


In [None]:
# Import function to calculate biased words
from utils import calculate_biased_words

💭 **Reflection**: You will have to change the following words to words that are illustrative of a **target concept**, organized in some kind of binary. Think of "male" and "female", "Islam" and "Christianity", or "career" and "family". For some examples of target sets that have been used in the literature on language biases, check the [bottom of this notebook](#targets).

Discuss with your teammates which concepts you would like to explore, and how you would define the target concept.

In [None]:
target1 = ['WORD1' , 'WORD2' , 'WORD3' , 'WORD4' , 'WORD5' , 'WORD6' , 'WORD7' , 'WORD8']
target2 = ['WORD1' , 'WORD2' , 'WORD3' , 'WORD4' , 'WORD5' , 'WORD6' , 'WORD7' , 'WORD8']

In [None]:
model = Word2Vec.load('../../data/embeddings.emb')

In [None]:
[b1, b2] = calculate_biased_words(model, target1, target2, 4)

Let's print some biases.

In [None]:
print('Biased words towards target set 1')
print([word for word in b1.keys()])

In [None]:
print('Biased words towards target set 2')
print([word for word in b2.keys()] )

## Visualizing Biases using $t$-SNE

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.manifold import TSNE
%matplotlib inline

In [None]:
with open(tsne_path, 'rb') as f:
    X_tsne = pickle.load(f)
    
tsne_df = pd.read_pickle(tsne_df_path)

In [None]:
# Convert biased term keys to arrays
target1_idx = np.array([model.wv.key_to_index[key] for key in b1.keys()])
target2_idx = np.array([model.wv.key_to_index[key] for key in b2.keys()])

In [None]:
# Find t-sne values for the biased sets
X_target1 = X_tsne[target1_idx]
X_target2 = X_tsne[target2_idx]

In [None]:
from bokeh.io import show, output_notebook, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet

# Set up the Bokeh plot
output_notebook()

p = figure()

# Create ColumnDataSource for X_target1 (blue)
source1 = ColumnDataSource(data=dict(x=X_target1[:, 0], y=X_target1[:, 1], label=[model.wv.index_to_key[idx] for idx in target1_idx]))

# Create ColumnDataSource for X_target2 (red)
source2 = ColumnDataSource(data=dict(x=X_target2[:, 0], y=X_target2[:, 1], label=[model.wv.index_to_key[idx] for idx in target2_idx]))

# Add scatter plot for X_target1 (blue)
p.scatter(x='x', y='y', color='blue', size=8, source=source1)

# Add scatter plot for X_target2 (red)
p.scatter(x='x', y='y', color='red', size=8, source=source2)

# Add labels for X_target1
labels1 = LabelSet(x='x', y='y', text='label', x_offset=6, y_offset=3, source=source1, render_mode='canvas')
p.add_layout(labels1)

# Add labels for X_target2
labels2 = LabelSet(x='x', y='y', text='label', x_offset=6, y_offset=3, source=source2, render_mode='canvas')
p.add_layout(labels2)

# Show the plot
show(p)

## 💭 Reflection

Note that these binary target concepts are often a product of ideology and normativity in society: the gender binary is a good example. When checking for biases towards certain concepts, make sure you consider the fact that you are the one creating / reproducing these concepts, and that you may be reinforcing a constructed binary!

Also note that determining your own target concepts and biases is a **iterative** process. Try changing some of the words in the target concepts to see how the biased words and plot change, and discuss with your group what you think makes for a coherent and robust target set.

<a id='targets'></a>
# Existing Target Sets

Here are some other target sets that have been previously used in the literature:

* *Gender target sets taken from Nosek, Banaji, and Greenwald 2002.*
    - Female: `sister, female, woman, girl, daughter, she, hers, her`.
    - Male: `brother, male, man, boy, son, he, his, him`.
* *Religion target sets taken from Garg et al. 2018.*
    - Islam: `allah, ramadan, turban, emir, salaam, sunni, koran, imam, sultan, prophet, veil, ayatollah, shiite, mosque, islam, sheik, muslim, muhammad`.
    - Christianity: `baptism, messiah, catholicism, resurrection, christianity, salva-tion, protestant, gospel, trinity, jesus, christ, christian, cross,catholic, church`.
* *Racial target sets taken from Garg et al. 2017*
    - White last names: `harris, nelson, robinson, thompson, moore, wright, anderson, clark, jackson, taylor, scott, davis, allen, adams, lewis, williams, jones, wilson, martin, johnson`.
    - Hispanic last names: `ruiz, alvarez, vargas, castillo, gomez, soto,gonzalez, sanchez, rivera, mendoza, martinez, torres, ro-driguez, perez, lopez, medina, diaz, garcia, castro, cruz`.
    - Asian last names: `cho, wong, tang, huang, chu, chung, ng,wu, liu, chen, lin, yang, kim, chang, shah, wang, li, khan,singh, hong`.
    - Russian last names: `gurin, minsky, sokolov, markov, maslow, novikoff, mishkin, smirnov, orloff, ivanov, sokoloff, davidoff, savin, romanoff, babinski, sorokin, levin, pavlov, rodin, agin`.
* *Career/family target sets taken from Garg et al. 2018.*
    - Career: `executive, management, professional, corporation, salary, office, business, career`.
    - Family: `home, parents, children, family, cousins, marriage, wedding, relatives.Math: math, algebra, geometry, calculus, equations, computation, numbers, addition`.
* *Arts/Science target sets taken from Garg et al. 2018.*
    - Arts: `poetry, art, sculpture, dance, literature, novel, symphony, drama`.
    - Science: `science, technology, physics, chemistry, Einstein, NASA, experiment, astronomy`.