# Final exam: Text Data Science

In one of our labs, we saw how embeddings inherit subtle types 'gender bias' from the data they are trained on. In this exam we will do some further exploration of how embeddings
represent gender. Specifically, we will use gender pronouns to split lists of similar words. We do this without assuming or implying that gender is a binary distinction.


To begin, load in the usual modules.

In [None]:
from datascience import *
import numpy as np
import re
import gensim

import os
# this turns off some pesky warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"]="3"

%matplotlib inline
import warnings
warnings.filterwarnings("ignore",category=Warning)

# direct plots to appear within the cell, and set their style
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

We use the [gensim package](https://radimrehurek.com/gensim/index.html), as in our earlier labs. The following bit of code reads in 300-dimensional embedding vectors, trained using the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm on a collection of Wikipedia data. Specifically, it uses 6 billion tokens of Wikipedia, with a 400,000 word vocabulary. You can find other precompiled embeddings [here](https://www.diycode.cc/projects/RaRe-Technologies/gensim-data); you might be interested in swapping one or two of the in and seeing if the results change.


In [None]:
import gensim
import gensim.downloader as gdl
from gensim.models import KeyedVectors
glove = gdl.load("glove-wiki-gigaword-300")

In [None]:
vocab = set([w for w in glove.vocab])
print("The vocabulary size is %d" % len(vocab))

### Recall the gender bias in analogies

Recall that we uncovered several analogies that suggested 'societal bias' encoded in the analogies. This has been the topic of several academic studies. The nice example found in class (by Liana) is the following.


In [None]:
result1 = glove.most_similar(positive=['smart', 'boy'], negative=['girl'])
result2 = glove.most_similar(positive=['smart', 'girl'], negative=['boy'])

print('smart - girl + boy: ')
print([x[0] for x in result1])

print('smart - boy + girl: ')
print([x[0] for x in result2])



We'll now write a function that divides a set of words into two types: "masculine" and "feminine." A word will be said to be more "masculine" than "feminine" if it is more similar to traditional male pronouns than it is to traditional female pronouns. 



In [None]:
def split_words(words):
    male = []
    female = []
    for w in words:
        if glove.similarity(w, 'she') < glove.similarity(w, 'he'):
            male.append(w)
        else:
            female.append(w)
    return male, female
    
def split_similar_words(word):
    words = [w[0] for w in glove.most_similar(word)]
    return split_words(words)


Here's an example. We'll split the list of words {'his', 'hers', 'mother', 'father', 'football', 'gymnastics', 'hockey', 'skating'} into those that are closer to 'he' and those that are closer to 'she'


In [None]:
male, female = split_words(['his', 'hers', 'mother', 'father', 'football', 'gymnastics', 'hockey', 'skating'])
print(male)
print(female)

### 1. Explore some lists and comment

Try out this function by evaluating it on different lists of words. Choose three lists, show the results, and comment in a Markdown cell on what you find. Why are the lists interesting? What do they imply about the embeddings?

### 2. Splitting similar words

Here's an example of the `split_similar_words` function. All this function does is to call `glove.most_similar` and then `split_words` on the result. 

In [None]:
male, female = split_similar_words('family')
print(male)
print(female)

As above, try out the `split_similar_words` function by evaluating it on different words. Choose three words, show the results, and comment in a Markdown cell on what you find. Why are the lists interesting? 

### 3. Modifying the function

Now modify the function `split_words` to divide a list based on two sets of pronouns, the male list {'he', 'him', 'his'} and the female list {'she', 'her', 'hers'}. To do this, define the function `similarity` to be the *sum of the similarities of the word to each pronoun*.


In [None]:
# modified to use two sets of gender pronouns, by 
male_pronouns = ['he', 'him', 'his']
female_pronouns = ['she', 'her', 'hers']

def similarity(w, word_list):
    return ...

def split_words2(words):
    male = []
    female = []
    for w in words:
        if similarity(w, female_pronouns) < similarity(w, male_pronouns):
            male.append(w)
        else:
            female.append(w)
    return male, female

def split_similar_words2(word):
    words = [w[0] for w in glove.most_similar(word)]
    return split_words2(words)


### 4. How does the function change?

Implement the function `similarity` above. Then, compare the functions `split_similar_words2` and `split_similar_words` on several examples. Does the new function work any better? Worse? Taken together, what do the examples you find say about the representation of gender embeddings? 


### 5. Gender cycles

To finish, we'll explore the interesting the fact that when we alternate between finding
masculine and feminine words that are similar to a starting word, we quickly get a 2-cycle.

This is best illustrated with an example. Starting with 'football,' we go through the most similar words and stop when we get to one that is "more masculine than feminine."  Starting with that word, we go through its most similar words and stop when we get to one that is "more feminine than masculine." We then repeat, alternating between masculine words and feminine words.

Starting with 'football,' here is the sequence we find:

football  -> soccer (m) -> volleyball (f) -> basketball (m) -> volleyball (f) -> basketball (m) -> volleyball (f) -> ...

We can thus identify ('basketball', 'volleyball') as a "male"/"female" pair.



In [None]:
def most_masculine(word):
    for x in glove.most_similar(word, topn=100):
        if similarity(x[0], female_pronouns) < similarity(x[0],male_pronouns):
            return x[0]
    return None
        
def most_feminine(word):
    for x in glove.most_similar(word, topn=100):
        if similarity(x[0], female_pronouns) > similarity(x[0],male_pronouns):
            return x[0]
    return None


### Write a function to find gender cycles

To complete this exam, you now need to do two things. First, uncomment the following code block, and run it with the starting word 'football' and a few others. This depends on you defining the function `similarity` properly. Verify that you do indeed get the above cycle with 'football', 'volleyball' and 'basketball'. Try it out with different starting words.

Then, use this code to define a function `def gender_pairs(word)` that takes the starting word `word` and returns a pair `(male_word, female_word)` that corresponds to a 2-cycle.
Run the function on a few starting words that give interesting results. Comment on what you found in a Markdown cell.

In [None]:
#word = 'football'
#print('%s ' % word, end='')
#for i in np.arange(10):
#    prev = word
#    next = most_masculine(prev)
#    print(' -> %s (m)' % next, end=''),
#    word = most_feminine(next)
#    print(' -> %s (f)' % word, end=''),
#    if word == prev:
#        break

    
        

In [None]:
def gender_pairs(word):
    ...

### Submit your exam

When you are finished, submit your exam on Canvas in the usual way, printing to html/pdf and also uploading your notebook. Thank you!


