In this notebook, we'll combine two different works: we'll use the geometry of word embeddings used Kozlowski et al. (2019), "[The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)" to explore axes of cultural meaning in So and Rowland (2020), "[Race and Distant Reading](https://www.mlajournals.org/doi/abs/10.1632/pmla.2020.135.1.59)".

In [1]:
import re
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import numpy as np
from pandas import *
import numpy.linalg as LA

First, let's load up two sets of pre-trained embeddings.  `fiction.embeddings.txt` is trained on 296 works of contemporary fiction (published between 1924 and 2020); `bbip.embeddings.txt` is trained on 45 works written by Black writers, selected from the [Black Book Interactive Project](https://bbip.ku.edu).

In [2]:
fiction_vectors = KeyedVectors.load_word2vec_format("../data/fiction.embeddings.txt", binary=False)

In [3]:
bbip_vectors = KeyedVectors.load_word2vec_format("../data/bbip.embeddings.txt", binary=False)

In [4]:
def get_vector(word, vectors):
    return vectors[word]/LA.norm(vectors[word])

In [5]:
def get_affluence_vector(vectors):
    
    """ affluence word pairs from Kozlowski et al. 2019) """
    
    vecs=[]
    vecs.append(get_vector("rich", vectors)-get_vector("poor", vectors))
    vecs.append(get_vector("richer", vectors)-get_vector("poorer", vectors))
    vecs.append(get_vector("richest", vectors)-get_vector("poorest", vectors))
    vecs.append(get_vector("affluence", vectors)-get_vector("poverty", vectors))
    vecs.append(get_vector("affluent", vectors)-get_vector("destitute", vectors))
    vecs.append(get_vector("wealthy", vectors)-get_vector("impoverished", vectors))
    vecs.append(get_vector("costly", vectors)-get_vector("economical", vectors))
    vecs.append(get_vector("expensive", vectors)-get_vector("inexpensive", vectors))
    vecs.append(get_vector("exquisite", vectors)-get_vector("ruined", vectors))
    vecs.append(get_vector("invaluable", vectors)-get_vector("cheap", vectors))
    vecs.append(get_vector("lavish", vectors)-get_vector("economical", vectors))
    vecs.append(get_vector("luxurious", vectors)-get_vector("threadbare", vectors))
    vecs.append(get_vector("luxury", vectors)-get_vector("cheap", vectors))
    vecs.append(get_vector("plush", vectors)-get_vector("threadbare", vectors))
    vecs.append(get_vector("precious", vectors)-get_vector("cheap", vectors))
    vecs.append(get_vector("priceless", vectors)-get_vector("worthless", vectors))
    vecs.append(get_vector("successful", vectors)-get_vector("unsuccessful", vectors))
    vecs.append(get_vector("sumptuous", vectors)-get_vector("plain", vectors))
    vecs.append(get_vector("rich", vectors)-get_vector("penniless", vectors))
    vecs.append(get_vector("posh", vectors)-get_vector("plain", vectors))

    # commenting out word pairs from Kozlowski et al. 2019 not in the vocabulary here 
#     vecs.append(get_vector("advantaged", vectors)-get_vector("needy", vectors))
#     vecs.append(get_vector("exorbitant", vectors)-get_vector("impecunious", vectors))
#     vecs.append(get_vector("extravagant", vectors)-get_vector("necessitous", vectors))
#     vecs.append(get_vector("flush", vectors)-get_vector("skint", vectors))
#     vecs.append(get_vector("luxuriant", vectors)-get_vector("penurious", vectors))
#     vecs.append(get_vector("moneyed", vectors)-get_vector("unmonied", vectors))
#     vecs.append(get_vector("opulent", vectors)-get_vector("indigent", vectors))
#     vecs.append(get_vector("luxuriant", vectors)-get_vector("penurious", vectors))
#     vecs.append(get_vector("privileged", vectors)-get_vector("underprivileged", vectors))
#     vecs.append(get_vector("propertied", vectors)-get_vector("bankrupt", vectors))
#     vecs.append(get_vector("prosperous", vectors)-get_vector("unprosperous", vectors))
#     vecs.append(get_vector("developed", vectors)-get_vector("underdeveloped", vectors))
#     vecs.append(get_vector("solvency", vectors)-get_vector("insolvency", vectors))
#     vecs.append(get_vector("swanky", vectors)-get_vector("basic", vectors))
#     vecs.append(get_vector("thriving", vectors)-get_vector("disadvantaged", vectors))
#     vecs.append(get_vector("upscale", vectors)-get_vector("squalid", vectors))
#     vecs.append(get_vector("valuable", vectors)-get_vector("valueless", vectors))
#     vecs.append(get_vector("classy", vectors)-get_vector("beggarly", vectors))
#     vecs.append(get_vector("ritzy", vectors)-get_vector("ramshackle", vectors))
#     vecs.append(get_vector("opulence", vectors)-get_vector("indigence", vectors))
#     vecs.append(get_vector("solvent", vectors)-get_vector("insolvent", vectors))
#     vecs.append(get_vector("moneyed", vectors)-get_vector("moneyless", vectors))
#     vecs.append(get_vector("affluence", vectors)-get_vector("penury", vectors))
#     vecs.append(get_vector("opulence", vectors)-get_vector("indigence", vectors))

    return np.mean(vecs, axis=0)

In [6]:
def get_gender_vector(vectors):
    
    """ gender word pairs from Kozlowski et al. 2019) """

    vecs=[]
    vecs.append(get_vector("man", vectors)-get_vector("woman", vectors))
    vecs.append(get_vector("men", vectors)-get_vector("women", vectors))
    vecs.append(get_vector("he", vectors)-get_vector("she", vectors))
    vecs.append(get_vector("him", vectors)-get_vector("her", vectors))
    vecs.append(get_vector("his", vectors)-get_vector("her", vectors))
    vecs.append(get_vector("his", vectors)-get_vector("hers", vectors))
    vecs.append(get_vector("boy", vectors)-get_vector("girl", vectors))
    vecs.append(get_vector("boys", vectors)-get_vector("girls", vectors))
    vecs.append(get_vector("male", vectors)-get_vector("female", vectors))

    # commenting out word pairs from Kozlowski et al. 2019 not in the vocabulary here 
    #     vecs.append(get_vector("masculine", vectors)-get_vector("feminine", vectors))

    return np.mean(vecs, axis=0)

In [7]:
def get_race_vector(vectors):
    
    """ race word pairs from Kozlowski et al. 2019) """
    
    vecs=[]
    vecs.append(get_vector("black", vectors)-get_vector("white", vectors))
    vecs.append(get_vector("blacks", vectors)-get_vector("whites", vectors))
    vecs.append(get_vector("african", vectors)-get_vector("european", vectors))
    vecs.append(get_vector("african", vectors)-get_vector("caucasian", vectors))
    
    # commenting out word pairs from Kozlowski et al. 2019 not in the vocabulary here 
    #     vecs.append(get_vector("afro", vectors)-get_vector("anglo", vectors))
    
    return np.mean(vecs, axis=0)

In [8]:
def get_sentiment_vector(vectors):
    
    """ let's add a sentiment dimension of our own """
        
    vecs=[]
    vecs.append(get_vector("good", vectors)-get_vector("bad", vectors))
    vecs.append(get_vector("better", vectors)-get_vector("worse", vectors))
    vecs.append(get_vector("best", vectors)-get_vector("worst", vectors))
    vecs.append(get_vector("great", vectors)-get_vector("terrible", vectors))

    return np.mean(vecs, axis=0)

In [9]:
def get_scores(terms, vectors):
    racial_vector=get_race_vector(vectors)
    affluence_vector=get_affluence_vector(vectors)
    gender_vector=get_gender_vector(vectors)
    sentiment_vector=get_sentiment_vector(vectors)
    all_scores=[]
    
    for term in terms:
        scores=[]
        scores.append(term)
        
        scores.append("%.3f" % vectors.cosine_similarities(racial_vector, [vectors[term]]))
        scores.append("%.3f" % vectors.cosine_similarities(affluence_vector, [vectors[term]]))
        scores.append("%.3f" % vectors.cosine_similarities(gender_vector, [vectors[term]]))
        scores.append("%.3f" % vectors.cosine_similarities(sentiment_vector, [vectors[term]]))

        all_scores.append(scores)
        
    print(DataFrame(all_scores, columns=["term", "racial", "affluence", "gender", "sentiment"]))
        

As Kozlowski et al. (2019) point out, we can orient individual words along these cultural axes by seeing where they fall along the continuum defined by the endpoints.  Word vectors with higher positive cosine similarities have stronger orientation toward "rich", "man", "black" and "positive" (along the affluence, gender, race and sentiment axes, respectively); word vectors with higher negative cosine similarities have stronger orientation toward "poor", "woman", "white" and "negative".

First, let's check the orientation for the seed terms that partly define the axes to get a sense of the bounds for each of the word vectors we have.

In [10]:
terms=["black", "white", "rich", "poor", "man", "woman", "good", "bad"]

In [11]:
get_scores(terms, fiction_vectors)

    term  racial affluence  gender sentiment
0  black  -0.036    -0.087   0.052    -0.005
1  white  -0.215     0.010  -0.082     0.071
2   rich  -0.053     0.192  -0.162     0.216
3   poor   0.009    -0.331  -0.134    -0.088
4    man   0.073    -0.106   0.174     0.045
5  woman   0.009    -0.068  -0.280     0.012
6   good  -0.102     0.041  -0.006     0.296
7    bad  -0.015    -0.124   0.011    -0.249


In [12]:
get_scores(terms, bbip_vectors)

    term  racial affluence  gender sentiment
0  black   0.302    -0.043  -0.031     0.057
1  white   0.012    -0.034  -0.078     0.133
2   rich   0.239     0.058  -0.066     0.107
3   poor   0.235    -0.004  -0.106     0.097
4    man   0.080     0.152   0.276     0.106
5  woman   0.058     0.202  -0.238     0.069
6   good  -0.080     0.005   0.112     0.237
7    bad  -0.019     0.026   0.121    -0.195


Next, explore this semantic field by querying the orientation for other terms.  How does the affiliation of these terms with race, gender, affluence and sentiment accord with your expectations?  How could you use method to interrogate the representation of race, gender, and affluence in the datasets these works are drawn from?  Come up with a few terms of your own to query.

In [13]:
terms=["freedom", "slavery", "jazz", "opera", "family"]

In [14]:
get_scores(terms, fiction_vectors)

      term  racial affluence  gender sentiment
0  freedom  -0.053     0.137  -0.038     0.128
1  slavery   0.107    -0.007   0.052    -0.164
2     jazz   0.134     0.065  -0.010     0.099
3    opera   0.084     0.242  -0.168     0.207
4   family   0.060     0.313  -0.105     0.173


In [15]:
get_scores(terms, bbip_vectors)

      term  racial affluence  gender sentiment
0  freedom   0.192     0.278   0.051    -0.005
1  slavery   0.219     0.384   0.029    -0.076
2     jazz  -0.149     0.125  -0.158    -0.153
3    opera  -0.264     0.056  -0.075    -0.134
4   family   0.170     0.286  -0.009     0.221


Finally, let's take the terms identified in So and Rowland (2020) that contribute most to the misclassification of James Baldwin's *Giovanni's Room*.  In using the the geometry of word embeddings, we can examine *how* the words used in these texts being deployed.  How is this approach different from the methods used in So and Rowland (2020)?  How would you relate the conclusions you draw to their argument?

In [16]:
terms=["absolutely", "very", "course", "appalled", "might", "white"]

In [17]:
get_scores(terms, fiction_vectors)

         term  racial affluence  gender sentiment
0  absolutely  -0.131    -0.042  -0.188    -0.130
1        very  -0.088    -0.050  -0.113     0.034
2      course   0.027     0.067  -0.143     0.027
3    appalled  -0.067    -0.120  -0.013    -0.224
4       might  -0.105    -0.077  -0.053    -0.045
5       white  -0.215     0.010  -0.082     0.071


In [18]:
get_scores(terms, bbip_vectors)

         term  racial affluence  gender sentiment
0  absolutely   0.034     0.142  -0.076    -0.410
1        very   0.001     0.105  -0.123     0.100
2      course   0.069     0.234  -0.004    -0.022
3    appalled  -0.300     0.086  -0.099    -0.132
4       might  -0.139     0.144  -0.025     0.077
5       white   0.012    -0.034  -0.078     0.133
