Notebook to accompany paper in preparation by Arseniev-Koehler and Foster.
All code last checked on Python 3 in Windows 6/11/2018. Please do not cite or reuse this code yet. This code is still in preparation and may contain errors. 

# More Manly or Womanly? Measure Biases in Word2Vec Models with a Support Vector Machine

This project explores how language in the news is loaded with meanings of gender, morality, healthiness, and socio-economic status (SES). For example, which words are more masculine or feminine? Are certain words loaded with meanings of immorality or morality? 

In this code, we develop and then train a model to classify words with respect to each of these four dimensions (gender, morality, healthiness, and SES) on a set of training words. Then, we test model performance on a fresh set of testing words. 

Finally, we look at how language about body weight, such as "obese" and "slender,"  to see how these words are connoted with gender, morality, health, and social class. You might use this code to look at meanings of langauge in other arenas too - such as occupations, academic disciplines, or food. You might also extend this code to other types of meaning, or to other data sources. 

This notebook, Part B of our project, uses classical machine-learning methods, like a Support Vector Machine (SVM). For two alternate methods to check the robustness of your findings, look at code for [Part B and Part C](https://github.com/arsena-k/Word2Vec-bias-extraction). 

We start by loading up a trained Word2Vec model on news. We suggest a pre-trained model if you don't have one, or see [Part A](https://github.com/arsena-k/Word2Vec-bias-extraction) of this project for a tutorial on training your own Word2Vec model.  

**Table of Contents**

* Part 1. [Load up libraries and a Word2Vec Model](#Starting)
* Part 2: [Explanation of Classification Method](#Motivation)
* Part 3: [Load up Training/Testing Words](#LoadUp)
* Part 4: [Robustness Checks](#Robustness)
* Part 5: [Visualize how this Dimension Classifies words according to Gender, Morality, Health, and SES](#Results)


*This is a long notebook. You can skip Part 3 (robustness checks) if you want to get right to the results.*

<a id='Staring'></a> 
# Part 1. Load up libraries and a Word2Vec Model

In [192]:
from sklearn import tree
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec,KeyedVectors
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold 
from sklearn import svm
from sklearn.neural_network import MLPClassifier
import sklearn
import csv
import statistics
from sklearn import datasets, decomposition, preprocessing
import gensim
np.set_printoptions(threshold=np.inf) #do this if you want to print full output
import os
import seaborn as sns
from pylab import rcParams
from pylab import xlim
from IPython.display import Image
from IPython.core.display import HTML
%matplotlib inline
cwd= os.getcwd()

**Load up a pretrained Word2Vec Model**

*Don't have a model? Use a pretrained Word2Vec Model from Google, trained on Google News*
* Download a pre-trained model on GoogleNews, find link to download on this [site](https://code.google.com/archive/p/word2vec/) or direct link to [download here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).
* Extract the files, and make sure you have the one called "GoogleNews-vectors-negative300.bin.gz" in your working directory. Your working directory is the folder where this Jupyter notebook is currently saved. Currently, the code assumes that your downloads folder is your working directory. 
* Some of the vocabulary words used in this notebook may not exist, since the vocabulary words used in this notebook were selected based on a model trained on the New York Times, however the code will still run fine. 

In [21]:
#   An example for a PC computer if your model is in your downloads folder, and you're using the Google model 
#currentmodel=  KeyedVectors.load_word2vec_format('C:/Users/Alina Arseniev/Downloads/GoogleNews-vectors-negative300.bin.gz', binary=True)

#   An example for a Mac if your model is in your downloads folder, and you're using the Google model 
#currentmodel=  KeyedVectors.load_word2vec_format("~/Downloads/GoogleNews-vectors-negative300.bin.gz", binary=True)

#   Example based on my set-up of folders:
currentmodel=  KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

*Have your own Word2Vec model? Load it up below.*
* Some of the vocabulary words used in this notebook may not exist, since the vocabulary words used in this notebook were selected based on a model trained on the New York Times, however the code will still run fine. You might consider curating the words more to your vocabulary, especially if many words are missing in your vocabulary. 

In [22]:
#    An example for a PC computer if your model is in your downloads folder, and you're using a model named "modelA_ALLYEARS_500dim_10CW" 
#currentmodel=  Word2Vec.load("C:/Users/Alina Arseniev/Downloads/modelA_ALLYEARS_500dim_10CW")

#   An example for a Mac if your model is in your downloads folder, and you're using a model named "modelA_ALLYEARS_500dim_10CW" 
#currentmodel=  Word2Vec.load("~/Downloads/modelA_ALLYEARS_500dim_10CW")

#   Example based on my set-up of folders:
currentmodel=  Word2Vec.load("modelA_ALLYEARS_300dim_10CW") #load up a trained Word2Vec model. You'll need to tailor this path to your computer

<a id='Motivation'></a> 
# Part 2. Explanation of The Classification Method

This method trains a four machine-learning classifiers (mainly,  **Support Vector Machines**, or SVM) to classify words according to four dimensions: as feminine/masculine, moral/immoral, healthy/unhealthy, and high/low socio-econonic status. We feed in a training set of word-vectors we strongly expect to lie at one end of the dimensions or the other. For each of the four dimensions, a SVM model *learns* a hyperplane separate the two classes. 

A limitation of this machine-learning method, in contrast with the geomteric classification methods in Parts C and C, is that SVM models **risk being overparametertrized and overfitted.** This is because word-vectors tend to be a few hundred-dimensions and we have usually only have around 100 training words to learn a dimensions, so we often have more "features" than "training examples." Cross-validation and the use of a fresh set of testing words helps us understand how much our model may be overfitted. Using cross-validation, we found that using PCA to reduce the dimensionality of word-vectors further did *not* improve results. We also only use a linear SVM model, and try out other possible machine-learning models (like a random forest) as comparison machine-learned models. 

* In our experiments, we did find that the SVM model to classify gender **initially did not corroborate our results** about how langauge around body weight is gendered, compared to the classification methods in Parts B and C. In fact, langauge about body weight was seemingly classified without pattern. This was also surprising to us given the large body of qualitaitve literature suggesting how our concepts of body weight is gendered. We speculated that this mismatch may be because gender training words are the most "sharply" or "widely" separated - **our training words were too easy**. Indeed, many of our training words for gender are explicilty and largely defined by their gendered meanings, such as "he", "him", and "man", "man" and "machismo."  In contrast, training words for other dimensions, like "yuppie" for classifying SES, carry more layers of meaning than just social class. "Yuppie," for example, also carries meaning about age and urbanization. 
* Thus, the gendered differences in our training set may be *too easy* to classify, leading the model to overfit to these training cases. To test this, we added some noisy words into our training words for gender, using words we thought were more implicitly gendered (e.g., “independent” as masculine, and “dependent” as feminine). We used two versions of updated training words, to vary the number of noisy word-vectors added. In both cases, words about body weight were **now classified in ways that corroborated findings from our other two methods.** Testing accuracy also increased from XX% to XX%. When we used these revised training sets for the other two classification methods, there were no changes in initial empirical findings, suggesting that our other two methods are more robust to overfitting. 



<a id='LoadUp'></a> 
# Part 3. Create a Dataset of Training and Testing Word-Vectors

Training Set

In [196]:
#a function to select the training words to find a dimensions
def select_training_set(trainingset): #options are: gender, moral, health, ses
    #gender is the main training set used to extract gender
    #gender_2 has fewer precise gender words like "he" vs "she" than set 1,  and some more noise via words that are gendered but less clearcut than Set1. This set was used for experimenting acorss different methods. 
    #gender_3 even fewer precise gender words like "he" vs "she" than set 1,  and same added noise as training set 2. This set was used for experimenting acorss different methods.
    if trainingset=='gender':
        pos_word_list=['womanly', 'my_wife', 'my_mom', 'my_grandmother', 'woman', 'women', 'girl', 'girls', 'her', 'hers', 'herself', 'she', 
            'lady', 'gal', 'gals', 'madame', 'ladies', 'lady',
          'mother', 'mothers', 'mom', 'moms', 'mommy', 'mama', 'ma', 'granddaughter', 'daughter', 'daughters', 'aunt', 'godmother', 
          'grandma', 'grandmothers', 'grandmother', 'sister', 'sisters', 'aunts', 'stepmother', 'granddaughters', 'niece',
          'fiancee', 'ex_girlfriend', 'girlfriends', 'wife', 'wives', 'girlfriend', 'bride', 'brides', 'widow',
           'twin_sister', 'younger_sister', 'teenage_girl', 'teenage_girls', 'eldest_daughter','estranged_wife', 'schoolgirl',
          'businesswoman', 'congresswoman' , 'chairwoman', 'councilwoman', 'waitress', 'hostess', 'convent', 'heiress', 
           'saleswoman', 'queen', 'queens', 'princess', 'nun' , 'nuns', 'heroine', 'actress', 'actresses', 'uterus', 'vagina', 'ovarian_cancer',
           'maternal', 'maternity', 'motherhood', 'sisterhood', 'girlhood', 'matriarch', 'sorority', 
         'older_sister', 'oldest_daughter', 'stepdaughter']
        neg_word_list=['manly', 'my_husband', 'my_dad','my_grandfather', 'man', 'men', 'boy', 'boys', 'him', 'his', 'himself', 'he', 'guy', 'dude',
            'dudes', 'sir', 'guys', 'gentleman','father', 'fathers', 'dad', 'dads', 'daddy', 'papa', 'pa', 'grandson' , 'son', 'sons', 'uncle', 'godfather', 
           'grandpa', 'grandfathers', 'grandfather', 'brother', 'brothers' , 'uncles', 'stepfather', 'grandsons', 'nephew',
           'fiance', 'ex_boyfriend', 'boyfriends', 'husband', 'husbands', 'boyfriend', 'groom', 'grooms', 'widower',
            'twin_brother', 'younger_brother', 'teenage_boy', 'teenage_boys', 'eldest_son', 'estranged_husband', 'schoolboy',
            'businessman', 'congressman', 'chairman', 'councilman', 'waiter', 'host', 'monastery', 'heir', 'salesman', 
            'king', 'kings', 'prince', 'monk', 'monks', 'hero', 'actor', 'actors', 'prostate', 'penis', 'prostate_cancer', 
           'paternal', 'paternity', 'fatherhood', 'brotherhood', 'boyhood', 'patriarch', 'fraternity', 
           'older_brother', 'oldest_son', 'stepson']
        pos_word_replacement='woman' #here's the generic replacement for feminine words
        neg_word_replacement='man' #here's the generic replacement for masculine words
    elif trainingset=='moral':
        pos_word_list= ['good', 'benevolent', 'nice', 'caring', 'conscientious', 'polite', 'fair', 'virtue', 'respect', 'responsible', 
            'selfless', 'unselfish', 'sincere', 'truthful', 'wonderful', 'justice', 'innocent', 'innocence',
           'complement', 'sympathetic', 'virtue', 'right', 'proud', 'pride','respectful', 'appropriate', 'pleasing', 'pleasant', 
            'pure', 'decent', 'pleasant', 'compassion' , 'compassionate', 'constructive','graceful', 'gentle', 'reliable',
           'careful', 'help', 'decent' , 'moral', 'hero', 'heroic', 'heroism', 'honest', 'honesty',
           'selfless', 'humility', 'humble', 'generous', 'generosity', 'faithful', 'fidelity', 'worthy', 'tolerant',
            'obedient', 'pious', 'saintly', 'angelic', 'virginal', 'sacred', 'reverent', 'god', 'hero', 'heroic', 
            'forgiving', 'saintly','holy', 'chastity', 'grateful', 'considerate', 'humane', 
            'trustworthy', 'loyal', 'loyalty', 'empathetic', 'empathy', 'clean', 'straightforward', 'pure']
        neg_word_list= ['bad', 'evil', 'mean', 'uncaring', 'lazy', 'rude', 'unfair', 'sin', 'disrespect','irresponsible', 
           'self_centered', 'selfish', 'insincere', 'lying', 'horrible', 'injustice', 'guilty', 'guilt', 
            'insult', 'unsympathetic', 'vice', 'wrong', 'ashamed', 'shame', 'disrespectful', 'inappropriate', 'vulgar', 'crude', 
            'dirty', 'obscene', 'offensive', 'cruelty','brutal', 'destructive', 'rude', 'harsh', 'unreliable',
            'careless', 'harm', 'indecent', 'immoral', 'coward', 'cowardly', 'cowardice', 'dishonest', 'dishonesty',
            'narcissistic', 'arrogance', 'arrogant', 'greedy', 'greed', 'betray', 'betrayal', 'unworthy', 'intolerant', 
             'defiant', 'rebellious', 'demonic','devilish', 'promiscuous', 'profane', 'irreverent', 'devil', 'villain', 'villainous', 
            'vindictive', 'diabolical', 'unholy', 'promiscuity', 'ungrateful', 'thoughtless', 'inhumane',
            'untrustworthy', 'treacherous', 'treachery', 'callous', 'indifference', 'dirty', 'manipulative', 'impure' ]
        pos_word_replacement='moral' #here's the generic replacement for moral words
        neg_word_replacement='immoral' #here's the generic replacement for immoral words
    elif trainingset=='health':
        pos_word_list= ['fertile', 'help_prevent', 'considered_safe', 'safer', 'healthy', 'healthy', 'healthy', 'healthy', 'healthy',
            'healthful', 'well_balanced', 'natural', 'healthy', 'athletic','physically_active', 'health',
            'health', 'nutritious','nourishing', 'stronger', 'strong','wellness', 'safe', 'nutritious_food','exercise',
            'physically_fit', 'unprocessed', 'healthier_foods', 'nutritious_foods', 'nutritious', 'nutritious',
           'healthy_eating', 'healthy_diet', 'healthy_diet', 'nourishing', 'nourished', 'regular_exercise', 'safety', 'safe', 
            'helpful', 'beneficial', 'healthy', 'healthy', 'sturdy', 'lower_risk', 'reduced_risk', 'decreased_risk', 'nutritious_foods', 'whole_grains', 'healthier_foods',
            'healthier_foods', 'physically_active', 'physical_activity', 'nourished', 'vitality', 'energetic', 'able_bodied',
            'resilience', 'strength', 'less_prone', 'sanitary', 'clean',  'healing', 'heal', 'salubrious']   
        neg_word_list= ['infertile', 'cause_harm','potentially_harmful','riskier', 'unhealthy', 'sick', 'ill', 'frail', 'sickly', 
            'unhealthful','unbalanced', 'unnatural', 'dangerous', 'sedentary', 'inactive', 'illness', 
            'sickness', 'toxic', 'unhealthy', 'weaker', 'weak', 'illness', 'unsafe', 'unhealthy_foods', 'sedentary',
            'inactive', 'highly_processed', 'processed_foods', 'junk_foods', 'unhealthy_foods', 'junk_foods',
               'processed_foods', 'processed_foods', 'fast_food', 'unhealthy_foods', 'deficient', 'sedentary', 'hazard','hazardous', 
            'harmful', 'injurious',  'chronically_ill', 'seriously_ill', 'frail', 'higher_risk', 'greater_risk', 'increased_risk', 'fried_foods', 'fried_foods',
            'fatty_foods', 'sugary_foods', 'sedentary', 'physical_inactivity', 'malnourished', 'lethargy', 'lethargic', 'disabled',
            'susceptibility', 'weakness', 'more_susceptible', 'filthy', 'dirty', 'harming', 'hurt', 'deleterious']
        pos_word_replacement='healthy' #here's the generic replacement for healthy words
        neg_word_replacement='ill' #here's the generic replacement for unhealthy words
    elif trainingset=='ses':
        pos_word_list=['wealth', 'wealthier', 'wealthiest', 'affluence', 'prosperity', 'wealthy', 'affluent', 'affluent', 'prosperous',
                'prosperous','prosperous','disposable_income',  'wealthy','suburban','luxurious','upscale','upscale', 'luxury', 
                'richest', 'privileged', 'moneyed', 'privileged', 'privileged', 'educated', 'employed', 
                'elite', 'upper_income', 'upper_class', 'employment', 'riches', 'millionaire', 'aristocrat', 'college_educated',
                'abundant', 'lack', 'luxury', 'profitable', 'profit', 'well_educated', 'elites', 'heir', 'well_heeled', 
                'white_collar', 'higher_incomes', 'bourgeois', 'fortunate', 'successful','economic_growth', 'prosper', 'suburbanites']
        neg_word_list= ['poverty', 'poorer', 'poorest', 'poverty', 'poverty', 'impoverished', 'impoverished',  'needy',  'impoverished',
                 'poor', 'needy', 'broke', 'needy', 'slum', 'ghetto', 'slums', 'ghettos', 'poor_neighborhoods', 
                'poorest', 'underserved', 'disadvantaged','marginalized', 'underprivileged', 'uneducated', 'unemployed', 
                'marginalized', 'low_income', 'underclass','unemployment', 'rags', 'homeless', 'peasant', 'college_dropout', 
                'lacking', 'abundance', 'squalor', 'bankrupt', 'debt', 'illiterate' ,'underclass', 'orphan',  'destitute', 
                'blue_collar', 'low_income', 'neediest', 'less_fortunate', 'unsuccessful', 'economic_crisis', 'low_wage', 'homeless']
        pos_word_replacement='wealthy' #here's the generic replacement for rich words
        neg_word_replacement='poor' #here's the generic replacement for poor words
    elif trainingset=='gender_2':
        pos_word_list=[ 'girl', 'girls', 'her', 'hers', 'herself', 'she', 
            'lady', 'gal', 'gals', 'madame', 'ladies', 'lady',
          'mother', 'mothers', 'mom', 'moms', 'mommy', 'mama', 'ma', 'granddaughter', 'daughter', 'daughters', 'aunt', 'godmother', 
          'grandma', 'grandmothers', 'grandmother', 'sister', 'sisters', 'aunts', 'stepmother', 'granddaughters', 'niece',
        'fiancee', 'ex_girlfriend', 'girlfriends', 'wife', 'wives', 'girlfriend', 'bride', 'brides', 'widow',
           'twin_sister', 'younger_sister', 'teenage_girl', 'teenage_girls', 'eldest_daughter','estranged_wife', 'schoolgirl',
        'businesswoman', 'congresswoman' , 'chairwoman', 'councilwoman', 'waitress', 'hostess', 'convent', 'heiress', 
           'saleswoman', 'queen', 'queens', 'princess', 'nun' , 'nuns', 'heroine', 'actress', 'actresses', 'uterus', 'vagina', 'ovarian_cancer',
        'maternal', 'maternity', 'motherhood', 'sisterhood', 'girlhood', 'matriarch', 'sorority', 'mare', 'hen', 'hens', 'filly', 'fillies',
          'deer', 'older_sister', 'oldest_daughter', 'stepdaughter', 'pink',  'cute', 'dependent', 'nurturing', 'hysterical', 'bitch',  'dance', 'dancing'] 
        neg_word_list=['boy', 'boys', 'him', 'his', 'himself', 'he', 'guy', 'dude',
            'dudes', 'sir', 'guys', 'gentleman','father', 'fathers', 'dad', 'dads', 'daddy', 'papa', 'pa', 'grandson' , 'son', 'sons', 'uncle', 'godfather', 
        'grandpa', 'grandfathers', 'grandfather', 'brother', 'brothers' , 'uncles', 'stepfather', 'grandsons', 'nephew',
           'fiance', 'ex_boyfriend', 'boyfriends', 'husband', 'husbands', 'boyfriend', 'groom', 'grooms', 'widower',
            'twin_brother', 'younger_brother', 'teenage_boy', 'teenage_boys', 'eldest_son', 'estranged_husband', 'schoolboy',
            'businessman', 'congressman', 'chairman', 'councilman', 'waiter', 'host', 'monastery', 'heir', 'salesman', 
            'king', 'kings', 'prince', 'monk', 'monks', 'hero', 'actor', 'actors', 'prostate', 'penis', 'prostate_cancer', 
        'paternal', 'paternity', 'fatherhood', 'brotherhood', 'boyhood', 'patriarch', 'fraternity', 'stallion', 'rooster', 'roosters', 'colt',
           'colts', 'bull', 'older_brother', 'oldest_son', 'stepson', 'blue' ,'manly', 'independent', 'aggressive', 'angry', 'jerk', 'wrestle', 'wrestling'  ]
        pos_word_replacement='woman'
        neg_word_replacement='man'
    elif trainingset=='gender_3':
        pos_word_list=['madame', 'ladies', 'lady',
          'mother', 'mothers', 'mom', 'mama', 'granddaughter', 'daughter', 'daughters', 'aunt', 'godmother', 
          'grandma', 'grandmothers', 'grandmother', 'sister', 'sisters', 'aunts', 'stepmother', 'granddaughters', 'niece',
        'fiancee', 'ex_girlfriend', 'girlfriends', 'wife', 'wives', 'girlfriend', 'bride', 'brides', 'widow',
           'twin_sister', 'younger_sister', 'teenage_girl', 'teenage_girls', 'eldest_daughter','estranged_wife', 'schoolgirl',
        'businesswoman', 'congresswoman' , 'chairwoman', 'councilwoman', 'waitress', 'hostess', 'convent', 'heiress', 
           'saleswoman', 'queen', 'queens', 'princess', 'nun' , 'nuns', 'heroine', 'actress', 'actresses', 'uterus', 'vagina', 'ovarian_cancer',
        'maternal', 'maternity', 'motherhood', 'sisterhood', 'girlhood', 'matriarch', 'sorority', 'mare', 'hen', 'hens', 'filly', 'fillies',
          'deer', 'older_sister', 'oldest_daughter', 'stepdaughter', 'pink', 'cute', 'dependent', 'nurturing', 'hysterical', 'bitch',  'dance', 'dancing']
        neg_word_list=['sir', 'guys', 'gentleman','father', 'fathers', 'dad', 'papa', 'grandson' , 'son', 'sons', 'uncle', 'godfather', 
        'grandpa', 'grandfathers', 'grandfather', 'brother', 'brothers' , 'uncles', 'stepfather', 'grandsons', 'nephew',
           'fiance', 'ex_boyfriend', 'boyfriends', 'husband', 'husbands', 'boyfriend', 'groom', 'grooms', 'widower',
            'twin_brother', 'younger_brother', 'teenage_boy', 'teenage_boys', 'eldest_son', 'estranged_husband', 'schoolboy',
            'businessman', 'congressman', 'chairman', 'councilman', 'waiter', 'host', 'monastery', 'heir', 'salesman', 
            'king', 'kings', 'prince', 'monk', 'monks', 'hero', 'actor', 'actors', 'prostate', 'penis', 'prostate_cancer', 
        'paternal', 'paternity', 'fatherhood', 'brotherhood', 'boyhood', 'patriarch', 'fraternity', 'stallion', 'rooster', 'roosters', 'colt',
           'colts', 'bull', 'older_brother', 'oldest_son', 'stepson', 'blue' ,'manly', 'independent', 'aggressive', 'angry', 'jerk', 'wrestle', 'wrestling'  ]
        pos_word_replacement='woman'
        neg_word_replacement='man'
    elif trainingset=='gender_4':
        pos_word_list=['madame', 'ladies', 'lady',
          'mother', 'mothers', 'mom', 'mama', 'granddaughter', 'daughter', 'daughters', 'aunt', 'godmother', 
          'grandma', 'grandmothers', 'grandmother', 'sister', 'sisters', 'aunts', 'stepmother', 'granddaughters', 'niece',
        'fiancee', 'ex_girlfriend', 'girlfriends', 'wife', 'wives', 'girlfriend', 'bride', 'brides', 'widow',
           'twin_sister', 'younger_sister', 'teenage_girl', 'teenage_girls', 'eldest_daughter','estranged_wife', 'schoolgirl',
        'businesswoman', 'congresswoman' , 'chairwoman', 'councilwoman', 'waitress', 'hostess', 'convent', 'heiress', 
           'saleswoman', 'queen', 'queens', 'princess', 'nun' , 'nuns', 'heroine', 'actress', 'actresses', 'uterus', 'vagina', 'ovarian_cancer',
        'maternal', 'maternity', 'motherhood', 'sisterhood', 'girlhood', 'matriarch', 'sorority', 'mare', 'hen', 'hens', 'filly', 'fillies',
          'deer', 'older_sister', 'oldest_daughter', 'stepdaughter', 'pink', 'cute', 'dependent', 'nurturing', 'hysterical', 'bitch',  'dance', 'dancing']
        neg_word_list=['sir', 'guys', 'gentleman','father', 'fathers', 'dad', 'papa', 'grandson' , 'son', 'sons', 'uncle', 'godfather', 
        'grandpa', 'grandfathers', 'grandfather', 'brother', 'brothers' , 'uncles', 'stepfather', 'grandsons', 'nephew',
           'fiance', 'ex_boyfriend', 'boyfriends', 'husband', 'husbands', 'boyfriend', 'groom', 'grooms', 'widower',
            'twin_brother', 'younger_brother', 'teenage_boy', 'teenage_boys', 'eldest_son', 'estranged_husband', 'schoolboy',
            'businessman', 'congressman', 'chairman', 'councilman', 'waiter', 'host', 'monastery', 'heir', 'salesman', 
            'king', 'kings', 'prince', 'monk', 'monks', 'hero', 'actor', 'actors', 'prostate', 'penis', 'prostate_cancer', 
        'paternal', 'paternity', 'fatherhood', 'brotherhood', 'boyhood', 'patriarch', 'fraternity', 'stallion', 'rooster', 'roosters', 'colt',
           'colts', 'bull', 'older_brother', 'oldest_son', 'stepson', 'blue' ,'manly', 'independent', 'aggressive', 'angry', 'jerk', 'wrestle', 'wrestling'  ]
        pos_word_replacement='woman'
        neg_word_replacement='man'
    
    pos_words=[]
    neg_words=[]
    pos_word_list_checked=[]
    neg_word_list_checked=[]
    for i in pos_word_list:
        try:
            pos_words.append(currentmodel[i])
            pos_word_list_checked.append(i)
        except KeyError:
            #print(str(i) +  ' was not in this Word2Vec models vocab, and has been replaced with: ' + str(pos_word_replacement) ) #uncomment this to be alerted each time a pos training word-vector is replaced
            pos_words.append(currentmodel[pos_word_replacement])
            pos_word_list_checked.append(pos_word_replacement)
    for i in neg_word_list:
        try:
            neg_words.append(currentmodel[i])
            neg_word_list_checked.append(i)
        except KeyError:
            #print(str(i) +  ' was not in this Word2Vec models vocab, and has been replaced with: ' + str(neg_word_replacement) ) #uncomment this to be alerted each time a neg training word-vector is replaced
            neg_words.append(currentmodel[neg_word_replacement])
            neg_word_list_checked.append(neg_word_replacement)

    print('\033[1m' + "Number of pos train words: "+ '\033[0m' + str(len(pos_words)) + '\033[1m' + " Number of neg train words: " + '\033[0m' + str(len(neg_words)) )
    train_classes= np.concatenate((np.array(np.repeat(1, len(pos_words))), np.array(np.repeat(0, len(neg_words))))) #1 is feminine/moral/healthy/rich by default 0 is masculine/immoral/unhealthy/poor by default    
    words= np.concatenate((np.asarray(pos_words), np.asarray(neg_words)))
    words= preprocessing.normalize(np.asarray(words), norm='l2')
    pos_word_list_checked.extend(neg_word_list_checked) #pos_word_list now includes neg words
    
    return(pos_word_list_checked, words, train_classes)

Testing Set

In [195]:
def select_testing_set(testingset):
    if testingset=='gender':
        test_word_list= ['goddess', 'single_mother', 'girlish', 'feminine', 'young_woman', 'little_girl', 'ladylike', 'my_mother', 
           'teenage_daughter', 'mistress', 'great_grandmother', 'adopted_daughter', 'femininity', 'motherly', 'matronly', 
           'showgirl', 'housewife', 'vice_chairwoman', 'co_chairwoman', 'spokeswoman', 'governess', 'divorcee', 'spinster', 
           'maid', 'countess', 'pregnant_woman', 'landlady', 'seamstress', 'young_girl', 'waif', 'femme_fatale','comedienne',
            'boyish', 'masculine',  'lad', 'policeman', 'macho', 'gentlemanly', 'machismo',  'teenage_son', 
            'beau', 'great_grandfather', 'tough_guy', 'masculinity', 'bad_boy', 'spokesman', 'baron', 'adult_male', 'landlord', 'fireman', 'mailman', 'vice_chairman', 
           'co_chairman','young_man', 'bearded', 'mustachioed', 'con_man', 'homeless_man', 'gent', 'strongman']
        test_classes=np.repeat(1, 32).tolist() #1 is feminine
        masc2=np.repeat(0, 28).tolist() #0 is masculine
        for i in masc2:
            test_classes.append(i) 
    elif testingset=='moral':
        test_word_list= ['great', 'best', 'faith', 'chaste', 'wholesome', 'noble', 'honorable', 'immaculate', 'gracious', 
           'courteous', 'delightful', 'earnest', 'amiable', 'admirable', 'disciplined', 'patience', 'integrity',
            'restraint', 'upstanding', 'diligent', 'dutiful', 'loving', 'righteous','respectable', 'praise', 'devout', 'forthright',
            'depraved', 'repulsive', 'repugnant', 'corruption', 'vicious', 'unlawful', 'outrage',  'shameless', 'perverted',
            'filthy', 'lewd', 'subversive', 'sinister', 'murderous', 'perverse', 
           'monstrous', 'homicidal', 'indignant', 'misdemeanor', 'degenerate', 'malevolent', 'illegal','terrorist','terrorism',  
             'cheated', 'vengeful', 'culpable','vile', 'hateful', 'abuse', 'abusive', 'criminal', 'deviant']
        test_classes=np.repeat(1, 27 ).tolist() #1 is feminine
        masc2=np.repeat(0,33).tolist() #0 is masculine
        for i in masc2:
            test_classes.append(i)
    elif testingset=='health':
        test_word_list= [ 'balanced_diet', 'healthfulness', 'fiber', 'jogging', 'stopping_smoking', 'vigor', 
          'active', 'fit', 'flourishing', 'sustaining', 'hygienic', 'hearty', 'enduring', 'energized', 'wholesome', 
           'holistic', 'healed', 'fitter', 'health_conscious', 'more_nutritious', 'live_longer',  'exercising_regularly',
           'healthier_choices', 'healthy_habits', 'healthy_lifestyle', 'healthful_eating', 'immune', 
            'deadly', 'diseased',  'adverse', 'risky', 'fatal', 'filthy', 'epidemic', 'crippling', 'carcinogenic', 'carcinogen',
           'crippled', 'afflicted', 'contaminated', 'fatigued', 'detrimental', 'bedridden', 'incurable', 'hospitalized',
           'infected', 'ailing', 'debilitated', 'poisons', 'disabling', 'life_threatening', 'debilitating', 
           'chronic_illness', 'artery_clogging', 'hypertension','disease', 'stroke',
            'plague', 'fatty', 'smoking']
        test_classes=np.repeat(1, 27).tolist() #1 is feminine
        masc2=np.repeat(0, 33 ).tolist() #0 is masculine
        for i in masc2:
            test_classes.append(i) 
    elif testingset=='ses':
        test_word_list= ['rich', 'billionaire', 'banker',  'fortune', 'heiress', 'cosmopolitan', 'ornate', 'entrepreneur', 'sophisticated',
                'aristocratic', 'investor', 'highly_educated', 'better_educated',  'splendor', 
               'businessman', 'opulent', 'multimillionaire', 'philanthropist', 'estate', 'estates', 'chateau', 'fortunes', 
               'financier', 'young_professionals','tycoon', 'baron', 'grandeur', 'magnate', 
               'investment_banker', 'venture_capitalist', 'upwardly_mobile', 'highly_skilled', 'yuppies', 'genteel',
                         'homelessness', 'ruin', 'ruined', 'downtrodden', 'less_affluent',
                'housing_project', 'homeless_shelters', 'indigent', 'jobless', 'welfare',  
                'temporary_shelters','housing_projects', 'subsidized_housing', 'starving', 'beggars', 'orphanages',
                'dispossessed', 'uninsured', 'welfare_recipients', 'food_stamps', 
                'malnutrition',  'underemployed', 'disenfranchised', 'servants', 'displaced', 'poor_families'] 
        test_classes=np.repeat(1, 34).tolist()#1 is feminine
        masc2=np.repeat(0, 26).tolist() #0 is masculine
        for i in masc2:
            test_classes.append(i) 
    elif testingset=='gender_stereotypes':
        test_word_list=['petite', 'cooking', 'graceful',  'housework', 'soft', 'whisper', 'flirtatious', 'accepting', 'blonde', 'blond', 'doll', 'dolls','nurse',  'estrogen', 'lipstick','pregnant', 'nanny', 'pink', 
                 'sewing', 'modeling', 'dainty', 'gentle', 'children','pregnancy', 'nurturing', 'depressed', 'nice', 'emotional','depression', 'home', 'kitchen', 'quiet', 'submissive',
                   'soldier', 'army', 'drafted', 'military',   'beard', 'mustache', 'genius', 'engineering', 'math', 
                  'brilliant', 'strong', 'strength',  'politician', 'programmer','doctor', 'sexual', 'aggressive', 
                    'testosterone', 'tall', 'competitive', 'big', 'powerful', 'mean', 'sports', 'fighting', 'confident', 'rough', 'loud', 'worldly',
                   'experienced', 'insensitive', 'ambitious', 'dominant']
        test_classes=np.repeat(1, 33 ).tolist() #1 is feminine
        masc2=np.repeat(0,33).tolist() #0 is masculine
        for i in masc2:
            test_classes.append(i) 
    else:
        print('choose a testing set: gender, moral, health, or ses')
        
    test_words=[]
    test_word_list_checked=[]
    test_classes_checked=[] 
    for i in test_word_list:
        try:
            test_words.append(currentmodel[i])
            test_word_list_checked.append(i)
            test_classes_checked.append(test_classes[test_word_list.index(i)]) 
        except KeyError:
            continue
            #print(str(i) +  ' was not in this Word2Vec models vocab, and has been removed as a test word') #uncomment this to be alerted each time a test word is not included in your model's vocabulary
            #index_missing= test_word_list.index(i) #new
            #del(test_classes[index_missing]) 
            #test_words.append(currentmodel[test_word_replacement])
            #test_word_list_checked.append(test_word_replacement)
            #get index of word, and remove this from classes, and do not append to list of vectors and word-list

    test_words= preprocessing.normalize(np.asarray(test_words), norm='l2')

    test_classes_checked=np.asarray(test_classes_checked)
    print('\033[1m'+ "Number of test words in model vocabulary, out of 60: " + '\033[0m' + str(len(test_words)))
    return(test_word_list_checked, test_words, test_classes_checked)

<a id='Robustness'></a> 
# Part 4. Robustness Checks

Select the dimension you are interested in (gender, moral, health, or ses)

In [302]:
train_word_list_checked, train_words, train_classes= select_training_set('gender')

[1mNumber of pos train words: [0m85[1m Number of neg train words: [0m85


Do **cross validation** to see how accuracy changes at classifying words if we just use a subset of training words to extract the dimension and and look at how the dimension classifies that subset of words and how it classifes the held-out words. This tells us, for example, how robust our methods are to our word choices. It is also a way we can try different models, and different parametrizations of models, and see how overfitted our models are.  

In [303]:
#cross validation
kf= KFold(n_splits=len(words), shuffle=True)  
#n_splits written here is leave-one-word-out cross validation, which is the maximum for n_splits. Try various quantities of n_splits. 

#NOTE that we are only using our training words here, but dividing it into a "sub" training set, and an unseen set of the trainign words, we call in this code block the "test" set. But we do not use our unseen true "test" words until later on. 

trainacc=[] 
testacc=[] 

for train_index, test_index in kf.split(words): #only need the indices on pos words or neg words, then will be the same indices to use for both    
    
    clf=svm.SVC(kernel='linear', C=1) #Use linear kernel, since not much data. More complex kernels performed worse and very high SD on accuracy. #C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
    #clf = RandomForestClassifier(n_estimators=100, max_depth=3, max_features=None, random_state=234) #max_features=None means all features are tried, rather than a sample. So the only randomness is the data. Default is that a sample of sqrt(n_features) is tried out for each tree, but this doesn't perform as well on training data, and doesn't make sense theoretically since I expect there is a few specific features that carry most of gender information. I tried betwen max depth of 2-4; 3 seems best for gender but on smaller sets consider 2.
    #clf = MLPClassifier(hidden_layer_sizes=(5)) #to try neural network rather than SVM, but really not enough data here, its just a sample of how to change the ML classifier here. 
    clf= clf.fit(train_words[train_index], train_classes[train_index] )
    
    #Now get predictions on the "training" set of the fold
    predictions_training=clf.predict(train_words[train_index])
    trainacc.append(accuracy_score(train_classes[train_index], predictions_training)) #append accuracy from this specific training subset

    #Now get predictions on subset of unseen trainning words (i.e., validation set)
    predictions_testing=clf.predict(train_words[test_index])
    testacc.append(accuracy_score(train_classes[test_index], predictions_testing)) #append accuracy from this specific 'testing' subset

    
print('\033[1m' +'Mean Accuracy across Training Subsets:'  + '\033[0m'+ str(statistics.mean(trainacc)))
print('\033[1m' +'Standard Deviation of Accuracy across Training Subsets:'  + '\033[0m'+ str(statistics.stdev(trainacc)))
print('\033[1m' +  'Mean Accuracy across Held-Out Subsets: ' + '\033[0m'+ str(statistics.mean(testacc)))
print('\033[1m' +'Standard Deviation of Accuracy across Held-Out Subsets: ' + '\033[0m' + str(statistics.stdev(testacc)))

[1mMean Accuracy across Training Subsets:[0m0.952832438879
[1mStandard Deviation of Accuracy across Training Subsets:[0m0.0033484709842960454
[1mMean Accuracy across Held-Out Subsets: [0m0.884615384615
[1mStandard Deviation of Accuracy across Held-Out Subsets: [0m0.320721458628893


### Trying out different types of machine-learning classifers on all training data:

Try SVM on all training data, still using dimension selected above

In [107]:
clf=svm.SVC(kernel='linear', C=1) #after grid search, for Gender it seems that between C=1 to C=5 is ideal, and C=3 is best
clf = clf.fit(train_words, train_classes)
predictions =clf.predict(train_words) #consider using proba rather than binary
print('\033[1m' +'Accuracy on Training Data with SVM Classifier:'  + '\033[0m'+ str(accuracy_score(train_classes, predictions)))

[1mAccuracy on Training Data with SVM Classifier:[0m0.941176470588


Try Decision Tree on all training data, still using dimension selected above

In [108]:
#fit tree with chosen depth, for training set
clf = tree.DecisionTreeClassifier(max_depth=3) #tried betwen max depth of 2-4; 3 seems best for gender but on smaller sets consider 2.
clf = clf.fit(train_words, train_classes)
predictions =clf.predict(train_words) #consider using proba rather than binary
print('\033[1m' +'Accuracy on Training Data with Decision Tree Classifier:'  + '\033[0m'+ str(accuracy_score(train_classes, predictions)))

[1mAccuracy on Training Data with Decision Tree Classifier:[0m0.917647058824


Try a Random Forest on all training data, still using dimension selected above

In [109]:
clf = RandomForestClassifier(n_estimators=100, max_depth=3, max_features=None, random_state=234) #max_features=None means all features are tried, rather than a sample. So the only randomness is the data. Default is that a sample of sqrt(n_features) is tried out for each tree, but this doesn't perform as well on training data, and doesn't make sense theoretically since I expect there is a few specific features that carry most of gender information. I tried betwen max depth of 2-4; 3 seems best for gender but on smaller sets consider 2.
clf = clf.fit(train_words, train_classes)
predictions =clf.predict(train_words) #consider using proba rather than binary
print('\033[1m' +'Accuracy on Training Data with a Random Forest Classifier:'  + '\033[0m'+ str(accuracy_score(train_classes, predictions)))

[1mAccuracy on Training Data with a Random Forest Classifier:[0m0.994117647059


<a id='Results'></a> 
# Part 5. Results

Select the dimension you're interested in:

In [342]:
train_word_list_checked, train_words, train_classes= select_training_set('gender_3')  #note that many of these dimensions are likely picking up similar signal of "valence." We can see this, in part, because if we use a trainig set from one dimensiion we still do well on the testing set form another dimensions.  In Parts B and C we saw that the cosine smilarity of the extracted dimension was not 1, meaning that these methods are NOT picking up the same thing. Its a little less clear with this ML method, but this ML method is still an interesting experiment.  
test_word_list_checked, test_words, test_classes = select_testing_set('gender')

[1mNumber of pos train words: [0m80[1m Number of neg train words: [0m80
[1mNumber of test words in model vocabulary, out of 60: [0m60


Get Results

In [343]:
clf=svm.SVC(kernel='linear', C=1, probability=True) #after grid search, for Gender it seems that between C=1 to C=5 is ideal, and C=3 is best
clf = clf.fit(train_words, train_classes)
train_predictions =clf.predict(train_words) #consider using proba rather than binary
train_proba_predictions =clf.predict_proba(train_words) 

#clf = clf.fit(test_words, test_classes)
test_predictions =clf.predict(test_words) #consider using proba rather than binary
test_proba_predictions =clf.predict_proba(test_words) 

print('\033[1m' +'% Accuracy on Training Data with SVM Classifier:'  + '\033[0m'+ str(accuracy_score(train_classes, train_predictions)) )
print('\033[1m' +'% Accuracy on Testing Data with SVM Classifier:'  + '\033[0m'+ str(accuracy_score(test_classes, test_predictions)) )

print('\033[1m' +'N Accuracy on Training Data with SVM Classifier:'  + '\033[0m'+ str(accuracy_score(train_classes, train_predictions, normalize=False)) )
print('\033[1m' +'N Accuracy on Testing Data with SVM Classifier:'  + '\033[0m'+ str(accuracy_score(test_classes, test_predictions, normalize=False)) )

[1m% Accuracy on Training Data with SVM Classifier:[0m0.98125
[1m% Accuracy on Testing Data with SVM Classifier:[0m0.983333333333
[1mN Accuracy on Training Data with SVM Classifier:[0m157
[1mN Accuracy on Testing Data with SVM Classifier:[0m59


Visualize Training Results

In [None]:
#WORK HERE!!!!

rcParams['figure.figsize'] = 9,9
#xlim([-.03, .03])

train_classes_relabeled=[] #quick hack to get legend to show Positive/Negative instead of 0/1
for i in train_classes:
    if i==1:
        train_classes_relabeled.append('Positive')
    else:
        train_classes_relabeled.append('Negative')

myplot= sns.stripplot(train_proba_predictions, train_word_list_checked, train_classes_relabeled, jitter=True, size=10)
plt.legend(loc="upper left", bbox_to_anchor=[0, 1],
           ncol=2, shadow=True, title="True Class", fancybox=True)

plt.axvline(x=0, color='r', linestyle='-')
#plt.title('Train Words \n Positive is feminine/moral/healthy/high-ses \n Negative is masculine/immoral/unhealthy/low-ses')
#plt.xlabel('Predicted')
#plt.show()

Visualize Testing Results

Visualize Results about Body Weight Langauge

In [344]:
obese_words_seeifinmodel=['obese', 'obesity', 'diabetic', 'diabetes', 'weight', 'overweight', 'thin', 'slender', 'burly',
                'muscular', 'diet', 'dieting', 'health', 'healthy', 'unhealthy', 'fat', 'anorexic', 'anorexia', 'bulimia', 
                'beautiful', 'handsome', 'overeating', 'exercise', 'sedentary', 'bulimic', 'morbidly_obese', 'normal_weight',
                'seriously_overweight']
obese_word_list= []
obese_words=[]

#check if these words are in your model
for i in range(0, len(obese_words_seeifinmodel)):
    try:
        currentmodel[obese_words_seeifinmodel[i]]
        obese_words.append(currentmodel[obese_words_seeifinmodel[i]])
        obese_word_list.append(obese_words_seeifinmodel[i])
    except:
        print(str(i) + " was not in this model's vocabulary and has been removed")
        continue
obese_words=np.asarray(obese_words)
obese_words = preprocessing.normalize(obese_words, norm='l2')
obese_predictions =clf.predict(obese_words) #consider using proba rather than binary
obese_proba_predictions =clf.predict_proba(obese_words) #consider using proba rather than binary

In [345]:
for i in range(0, len(obese_words)):
    print(obese_word_list[i], obese_predictions[i], obese_proba_predictions[i])
    #print(obese_word_list[i], obese_predictions[i])

obese 0 [ 0.83567817  0.16432183]
obesity 0 [ 0.97957666  0.02042334]
diabetic 1 [ 0.29098905  0.70901095]
diabetes 0 [ 0.44302918  0.55697082]
weight 1 [ 0.24125158  0.75874842]
overweight 0 [ 0.93422192  0.06577808]
thin 1 [ 0.13589542  0.86410458]
slender 1 [  3.24099906e-06   9.99996759e-01]
burly 1 [ 0.10080493  0.89919507]
muscular 1 [ 0.00970164  0.99029836]
diet 1 [ 0.22915466  0.77084534]
dieting 0 [ 0.99010841  0.00989159]
health 1 [ 0.00821638  0.99178362]
healthy 1 [  6.11019209e-07   9.99999389e-01]
unhealthy 0 [ 0.99176605  0.00823395]
fat 1 [ 0.36354711  0.63645289]
anorexic 0 [ 0.98520339  0.01479661]
anorexia 0 [ 0.98359532  0.01640468]
bulimia 0 [ 0.99771094  0.00228906]
beautiful 1 [ 0.00394709  0.99605291]
handsome 1 [  6.66233590e-07   9.99999334e-01]
overeating 0 [  9.99754328e-01   2.45672244e-04]
exercise 1 [ 0.15248789  0.84751211]
sedentary 1 [ 0.04108203  0.95891797]
bulimic 0 [ 0.98745063  0.01254937]
morbidly_obese 1 [ 0.27716831  0.72283169]
normal_weight 