# Week 4 - Exploring Semantic Spaces (Word Embeddings)
This week, we build on last week's topic modeling techniques by taking a text corpus we have developed, specifying an underlying number of dimensions, and training a model with a neural network auto-encoder (one of Google's word2vec  algorithms) that best describes corpus words in their local linguistic contexts, and exploring their locations in the resulting space to learn about the discursive culture that produced them.

This is our third document representation we have learned: First, we used word counts. Second, we used LDA topic models built around term coocurrence in the same document (i.e., a "bag of words"). Third, documents here are represented as densely indexed locations in dimensions, so that distances between those documents (and words) contain more information, though they require the full vector of dimension loadings (rather than just a few selected topic loadings) to describe. We will explore these spaces to understand complex, semantic relationships between words, index documents with descriptive words, identify the likelihood that a given document would have been produced by a given vector model, and explore how semantic categories can help us understand the cultures that produced them.

Note that most modern natural language processing (NLP) research, at least in computer science, uses word embeddings. This is the foundation of most state-of-the-art models.

Also note that the code in this Notebook can take many minutes or even hours to run. This is the case for most NLP research these days, and it's a good opportunity to start thinking about how to manage high-compute workloads, such as running code on small samples to test it, loading datafiles in [chunks](https://stackoverflow.com/a/25962187), or [multiprocessing](https://en.wikipedia.org/wiki/Multiprocessing).

## <font color="red">*Pitch Your Project*</font>

<font color="red">In the three cells immediately following, describe **WHAT** you are planning to analyze for your final project (i.e., texts, contexts and the social game, world and actors you intend to learn about through your analysis) (<200 words), **WHY** you are going to do it (i.e., why would theory and/or the average person benefit from knowing the results of your investigation) (<200 words), and **HOW** you plan to investigate it (i.e., what are the approaches and operations you plan to perform, in sequence, to yield this insight) (<400 words).</font>

### ***What?***
For our project, Michael Plunkett and I will be analyzing congressional and supreme court abortion legislation. Particularly, for the congressional legislation, we will be analyzing all legislation since 1973 until 2024. This legislation was pulled from the [congress.gov legislation search](https://www.congress.gov/advanced-search/legislation?congressGroup%5B%5D=0&congresses%5B%5D=118&congresses%5B%5D=117&congresses%5B%5D=116&congresses%5B%5D=115&congresses%5B%5D=114&congresses%5B%5D=113&congresses%5B%5D=112&congresses%5B%5D=111&congresses%5B%5D=110&congresses%5B%5D=109&congresses%5B%5D=108&congresses%5B%5D=107&congresses%5B%5D=106&congresses%5B%5D=105&congresses%5B%5D=104&congresses%5B%5D=103&congresses%5B%5D=102&congresses%5B%5D=101&congresses%5B%5D=100&congresses%5B%5D=99&congresses%5B%5D=98&congresses%5B%5D=97&congresses%5B%5D=96&congresses%5B%5D=95&congresses%5B%5D=94&congresses%5B%5D=93&legislationNumbers=&restrictionType=field&restrictionFields%5B%5D=allBillTitles&restrictionFields%5B%5D=summary&summaryField=billSummary&enterTerms=%22reproductive+health+care%22%2C+%22reproduction%22%2C+%22abortion%22&legislationTypes%5B%5D=hr&legislationTypes%5B%5D=hjres&legislationTypes%5B%5D=s&legislationTypes%5B%5D=sjres&public=true&private=true&chamber=all&actionTerms=&legislativeActionWordVariants=true&dateOfActionOperator=equal&dateOfActionStartDate=&dateOfActionEndDate=&dateOfActionIsOptions=yesterday&dateOfActionToggle=multi&legislativeAction=Any&sponsorState=One&member=&sponsorTypes%5B%5D=sponsor&sponsorTypeBool=OR&dateOfSponsorshipOperator=equal&dateOfSponsorshipStartDate=&dateOfSponsorshipEndDate=&dateOfSponsorshipIsOptions=yesterday&committeeActivity%5B%5D=0&committeeActivity%5B%5D=3&committeeActivity%5B%5D=11&committeeActivity%5B%5D=12&committeeActivity%5B%5D=4&committeeActivity%5B%5D=2&committeeActivity%5B%5D=5&committeeActivity%5B%5D=9&satellite=null&search=&submitted=Submitted), where we filtered for any legislation within this time period that could have become bills and included the following keywords in the bill text or summary: 'abortion', 'reproduction', or 'reproductive health care'. For the SCOTUS abortion legislation, we targetted SCOTUS decisions outlined on supreme.justia.com, which [provides a list of abortion-relavant SCOTUS decision](https://supreme.justia.com/cases-by-topic/abortion-reproductive-rights/) from 1965-2022. Through this analysis, we plan to uncover the ways that the legislative abortion discourse has changed overtime, and the promonent congressional bills that occur throughout subsequent legislations. We will supplement this analysis by legislation in its political history, by taking note of the political affiliations of congressmembers, SCOTUS justices, and the presidency.

### ***Why?***
This analysis in general will give us an understanding of how abortion discourse has changed over time, and specifically what arguments were used in the passing of the 1973 Roe v. Wade decision--and with it, assertion of the constitutional right to abortion--and also in its eventual reversal in the 2022 Dobbs v. Jackson decision. From this, we will be able to uncover what political mechanisms were at play that enabled this regression in legislation, and how such large cases went on to influence congressional legislation that followed. The average person will in general be able to understand how sensitive practices are discussed in the political sphere and what aspects in particular are targetted for protection or not. Moreover, this analysis will allow individuals who want to write reproductive healthcare legislation to understand what arguments are more likely to work over others within certain political contexts.

### ***How?***
<font color="red">400 words</font>

## <font color="red">*Pitch Your Sample*</font>

<font color="red">In the cell immediately following, describe the rationale behind your proposed sample design for your final project. What is the social game, social work, or social actors you about whom you are seeking to make inferences? What are its virtues with respect to your research questions? What are its limitations? What are alternatives? What would be a reasonable path to "scale up" your sample for further analysis (i.e., high-profile publication) beyond this class? (<300 words).

### ***Which (words)?***
<300 words</font>

## Set Up

In [4]:
# Installations
# %pip install -U git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git
# %pip install fasttext
# %pip install cython
#%pip install ksvd

Collecting ksvd
  Downloading ksvd-0.0.3-py3-none-any.whl (3.0 kB)
Installing collected packages: ksvd
Successfully installed ksvd-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [17]:
# data 
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn
from IPython.display import Image
%matplotlib inline

# sklearn
import sklearn.metrics.pairwise
import sklearn.manifold
import sklearn.decomposition
from sklearn.metrics.pairwise import cosine_similarity

# gensim
import gensim
from gensim.models.doc2vec import TaggedDocument
from gensim.models import KeyedVectors, Word2Vec
from gensim import corpora, models, similarities
from gensim.test.utils import datapath

# models
import fasttext
from ksvd import ApproximateKSVD
from tqdm import tqdm
tqdm.pandas()

# misc
import lucem_illud
import re, string, cython, requests, nltk, copy, pickle, math
from random import seed, sample
import os
import os.path

## Helper Functions

In [5]:
# From the homework notebook
def normalize(vector):
    normalized_vector = vector / np.linalg.norm(vector)
    return normalized_vector

def dimension(model, positives, negatives):
    diff = sum([normalize(model[x]) for x in positives]) - sum([normalize(model[y]) for y in negatives])
    return diff

# Plotting
def Coloring(Series):
    x = Series.values
    y = x-x.min()
    z = y/y.max()
    c = list(plt.cm.rainbow(z))
    return c

def PlotDimension(ax,df, dim):
    ax.set_frame_on(False)
    ax.set_title(dim, fontsize = 20)
    colors = Coloring(df[dim])
    for i, word in enumerate(df.index):
        ax.annotate(word, (0, df[dim][i]), color = colors[i], alpha = 0.6, fontsize = 12)
    MaxY = df[dim].max()
    MinY = df[dim].min()
    plt.ylim(MinY,MaxY)
    plt.yticks(())
    plt.xticks(())

In [6]:
def calc_syn0norm(model):
    """since syn0norm is now depricated"""
    return (model.wv.syn0 / np.sqrt((model.wv.syn0 ** 2).sum(-1))[..., np.newaxis]).astype(np.float32)

In [7]:
def smart_procrustes_align_gensim(base_embed, other_embed, gen, words=None):
    base_embed = copy.copy(base_embed)
    other_embed = copy.copy(other_embed)

    # make sure vocabulary and indices are aligned
    in_base_embed, in_other_embed = gen(base_embed, other_embed, words=words)

    # get the embedding matrices
    base_vecs= [in_base_embed.wv.get_vector(w,norm=True) for w in set(in_base_embed.wv.index_to_key)]
    other_vecs= [in_other_embed.wv.get_vector(w,norm=True) for w in set(in_other_embed.wv.index_to_key)]

    # just a matrix dot product with numpy
    m = np.array(other_vecs).T.dot(np.array(base_vecs))

    # SVD method from numpy
    u, _, v = np.linalg.svd(m)

    # another matrix operation
    ortho = u.dot(v)
    
    # Replace original array with modified one
    # i.e. multiplying the embedding matrix (syn0norm)by "ortho"
    other_embed.wv.vectors =(np.array(other_vecs)).dot(ortho)
    return other_embed

In [8]:
def intersection_align_gensim(m1,m2, words=None):
    """
    Intersect two gensim word2vec models, m1 and m2.
    Only the shared vocabulary between them is kept.
    If 'words' is set (as list or set), then the vocabulary is intersected with this list as well.
    Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2).
    These indices correspond to the new syn0 and syn0norm objects in both gensim models:
        -- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
        -- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2
    The .vocab dictionary is also updated for each model, preserving the count but updating the index.
    """

    # Get the vocab for each model
    vocab_m1 = set(m1.wv.index_to_key)
    vocab_m2 = set(m2.wv.index_to_key)

    # Find the common vocabulary
    common_vocab = vocab_m1&vocab_m2
    if words: common_vocab&=set(words)

    # If no alignment necessary because vocab is identical...
    if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
        return (m1,m2)

    # Otherwise sort by frequency (summed for both)
    common_vocab = list(common_vocab)
    common_vocab.sort(key=lambda w: m1.wv.get_vecattr(w, "count")  + m2.wv.get_vecattr(w, "count") ,reverse=True)

    # Then for each model...
    for m in [m1,m2]:
        # Replace old syn0norm array with new one (with common vocab)
        new_arr = [m.wv.get_vector(w,norm=True) for w in common_vocab]

        # Replace old vocab dictionary with new one (with common vocab)
        # and old index2word with new one
        m.index2word = common_vocab
        # old_vocab = m.wv.index_to_key
        new_vocab = []
        k2i={}
        for new_index,word in enumerate(common_vocab):
            new_vocab.append(word)
            k2i[word]=new_index
        m.wv.index_to_key=new_vocab
        m.wv.key_to_index=k2i
        m.wv.vectors=np.array(new_arr)

    return (m1,m2)

In [9]:
def compareModels(df, raw_models, category, text_column_name='normalized_sents', sort = True, embeddings_raw={}):
    """If you are using time as your category sorting is important"""
    if len(embeddings_raw) == 0:
        embeddings_raw = raw_models(df, category, text_column_name, sort)
    cats = sorted(set(df[category]))
    #These are much quicker
    embeddings_aligned = {}
    for catOuter in cats:
        embeddings_aligned[catOuter] = [embeddings_raw[catOuter]]
        for catInner in cats:
            embeddings_aligned[catOuter].append(smart_procrustes_align_gensim(embeddings_aligned[catOuter][-1], embeddings_raw[catInner]))
    return embeddings_raw, embeddings_aligned

In [10]:
def rawModels(df, category, text_column_name='normalized_sents', sort = True):
    embeddings_raw = {}
    cats = sorted(set(df[category]))
    for cat in cats:
        #This can take a while
        print("Embedding {}".format(cat), end = '\r')
        subsetDF = df[df[category] == cat]
        #You might want to change the W2V parameters
        embeddings_raw[cat] = gensim.models.word2vec.Word2Vec(subsetDF[text_column_name].sum())
    return embeddings_raw

In [11]:
def getDivergenceDF(word, embeddingsDict):
    dists = []
    cats = sorted(set(embeddingsDict.keys()))
    dists = {}
    print(word)
    for cat in cats:
        dists[cat] = []
        for embed in embeddingsDict[cat][1:]:
            dists[cat].append(np.abs(1 - sklearn.metrics.pairwise.cosine_similarity(np.expand_dims(embeddingsDict[cat][0].wv[word], axis = 0),
                                                                             np.expand_dims(embed.wv[word], axis = 0))[0,0]))
    return pd.DataFrame(dists, index = cats)

In [12]:
def findDiverence(word, embeddingsDict):
    cats = sorted(set(embeddingsDict.keys()))

    dists = []
    for embed in embeddingsDict[cats[0]][1:]:
        try:
            dists.append(1 - sklearn.metrics.pairwise.cosine_similarity(np.expand_dims(embeddingsDict[cats[0]][0].wv[word], axis = 0), np.expand_dims(embed.wv[word], axis = 0))[0,0])
        except:
            pass
    return np.mean(dists)

In [13]:
def findMostDivergent(embeddingsDict, comparedEmbeddings):
    original_words = comparedEmbeddings[1950][0].wv.index_to_key
    for embeds in embeddingsDict.values():
        for embed in embeds:
            original_words = set(original_words).intersection(set(embed.wv.index_to_key))
    words = set(original_words)
    print("Found {} words to compare".format(len(words)))
    return sorted([(w, findDiverence(w, embeddingsDict)) for w in words], key = lambda x: x[1], reverse=True)

In [14]:
def file_to_embeddings(address, kind):
    rawEmbeddings = {}
    for file in os.listdir(address):
        if "embedding_"+kind in file:
            e, kind_, kind_type = file.split("_")
            kind_type = eval(kind_type)
            rawEmbeddings[kind_type] = Word2Vec.load(file)
    return rawEmbeddings

In [24]:
# get random indices
def random_indices(df_len, num_indices):
    """
    Generate a list of random indices.

    returns: list of random indices
    """
    random_indices = np.random.choice(df_len, num_indices, replace=False)
    random_indices = list(random_indices)
    random_indices.sort()

    return random_indices

## Data Loading

In [19]:
# loading congressional data
congress_df = pd.read_csv('../data/congress_legislation_cleaned.csv')
congress_df = congress_df.loc[:, ['congress_num', 'legislation number', 'title', 'cleaned_text']]
congress_df.head()

Unnamed: 0,congress_num,legislation number,title,cleaned_text
0,118,H.R. 2907,Let Doctors Provide Reproductive Health Care Act,Congressional Bills 118th Congress From the U....
1,118,S. 1297,Let Doctors Provide Reproductive Health Care Act,Congressional Bills 118th Congress From the U....
2,118,H.R. 4901,Reproductive Health Care Accessibility Act,Congressional Bills 118th Congress From the U....
3,118,S. 2544,Reproductive Health Care Accessibility Act,Congressional Bills 118th Congress From the U....
4,118,H.R. 4147,Reproductive Health Care Training Act of 2023,Congressional Bills 118th Congress From the U....


In [22]:
# loading scotus data
scotus_df = pd.read_csv('../data/scotus_cases_cleaned.csv')
scotus_df = scotus_df.loc[:, ['case', 'year', 'author', 'cleaned_text']]
scotus_df.head()

Unnamed: 0,case,year,author,cleaned_text
0,Dobbs v. Jackson Women's Health Organization,2022,"Samuel A. Alito, Jr.","1 (Slip Opinion) OCTOBER TERM, 2021 Syllabus N..."
1,Whole Woman's Health v. Hellerstedt,2016,Stephen Breyer,"1 (Slip Opinion) OCTOBER TERM, 2015 Syllabus N..."
2,Gonzales v. Carhart,2007,Anthony Kennedy,550US1 U nit: U31 07 28 10 12:14:15 P A GES PG...
3,Stenberg v. Carhart,2000,Stephen Breyer,"OCTOBER TERM, 1999 Syllabus STENBERG, ATTORNEY..."
4,Planned Parenthood of Southeastern Pennsylvani...,1992,"Anthony Kennedy, David Souter, Sandra Day O’Co...",505us3u117 07 09 96 09:34:02 PAGES OPINPGT 833...


### Tokenization

In [34]:
# tokenize and normalize sentences
# congress_df['tokenized_sents'] = congress_df['cleaned_text'].progress_apply(lambda x: [lucem_illud.word_tokenize(s, MAX_LEN=5000000) for s in lucem_illud.sent_tokenize(x)])

100%|██████████| 1243/1243 [2:33:00<00:00,  7.39s/it]   


In [35]:
# congress_df['normalized_sents'] = congress_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, lemma=False) for s in x])



In [None]:
congress_df = pd.read_csv('../data/congress_legislation_cleaned_normalized.csv')

In [36]:
congress_df.to_csv("congress_legislation_tokenized_sents.csv")

## Exercises

### <font color="red">*Exercise 1*</font>

<font color="red">Construct cells immediately below this that embed documents related to your final project using at least two different specification of `word2vec` and/or `fasttext`, and visualize them each with two separate visualization layout specifications (e.g., TSNE, PCA). Then interrogate critical word vectors within your corpus in terms of the most similar words, analogies, and other additions and subtractions that reveal the structure of similarity and difference within your semantic space. What does this pattern reveal about the semantic organization of words in your corpora? Which estimation and visualization specification generate the most insight and appear the most robustly supported and why?

<font color="red">***Stretch***: Explore different vector calculations beyond addition and subtraction, such as multiplication, division or some other function. What does this exploration reveal about the semantic structure of your corpus?

#### Create Word2Vec Model

In [37]:
congress_w2v = gensim.models.word2vec.Word2Vec(congress_df['normalized_sents'].sum(), sg=0)
print('"abortion" vector: ', congress_w2v.wv['abortion'][:10])
print('word at index 10: ', congress_w2v.wv.index_to_key[10])

"abortion" vector:  [-4.8669734  -0.6166447   1.3110815   0.72449046  0.7318664  -0.4321819
 -6.2501717   1.7904253   1.7648962  -1.5254838 ]
word at index 10:  united


#### Word2Vec Similarity analysis

In [38]:
# determine words most similar to abortion
congress_w2v.wv.most_similar('abortion')

[('abortions', 0.7530140280723572),
 ('sterilization', 0.5249667763710022),
 ('induced', 0.5214768052101135),
 ('counsels', 0.517237663269043),
 ('obstetric', 0.4906849265098572),
 ('overt', 0.4834439754486084),
 ('incest', 0.45907044410705566),
 ('mother', 0.4565138518810272),
 ('heartbeat', 0.443374365568161),
 ('unemancipated', 0.43996715545654297)]

In [39]:
# determine words most similar to privacy
congress_w2v.wv.most_similar('privacy')

[('safeguards', 0.622541606426239),
 ('disclosure', 0.5984277725219727),
 ('confidentiality', 0.5983681082725525),
 ('liberties', 0.5723415017127991),
 ('identity', 0.5707226991653442),
 ('protects', 0.5446917414665222),
 ('retaliation', 0.537446141242981),
 ('whistleblowers', 0.5234901905059814),
 ('protections', 0.5112161040306091),
 ('safeguard', 0.49264729022979736)]

In [40]:
# determine words most similar to protect
congress_w2v.wv.most_similar('protect')


[('protects', 0.6279311180114746),
 ('preserve', 0.6121134161949158),
 ('safeguard', 0.6113435626029968),
 ('protecting', 0.5907005667686462),
 ('assure', 0.5389564633369446),
 ('desirable', 0.5373063087463379),
 ('avert', 0.5347046256065369),
 ('precautions', 0.5292848348617554),
 ('prevent', 0.5214537382125854),
 ('ensuring', 0.508388340473175)]

In [41]:
# determine words most similar to protect
congress_w2v.wv.most_similar('fetus')

[('mother', 0.7277065515518188),
 ('vaginally', 0.7140676975250244),
 ('kills', 0.7091748118400574),
 ('conception', 0.7035084962844849),
 ('heartbeat', 0.6972429156303406),
 ('fertilization', 0.690980076789856),
 ('detectable', 0.6764907240867615),
 ('embryonic', 0.673604428768158),
 ('overt', 0.6729521155357361),
 ('infanticide', 0.6714892983436584)]

In [42]:
# determine words most similar to protect
congress_w2v.wv.most_similar('baby')

[('infant', 0.7119036316871643),
 ('clothing', 0.622071385383606),
 ('perinatal', 0.6124640107154846),
 ('maternity', 0.599256694316864),
 ('diapers', 0.5970709919929504),
 ('childcare', 0.5838642716407776),
 ('infants', 0.5803161859512329),
 ('postnatal', 0.5798137784004211),
 ('male', 0.5770944356918335),
 ('otc', 0.5748292803764343)]

In [44]:
# determine words most similar to protect
congress_w2v.wv.most_similar('viability')

[('depends', 0.6447806358337402),
 ('heartbeat', 0.6290205121040344),
 ('harmful', 0.6250562071800232),
 ('cortex', 0.6246358156204224),
 ('connections', 0.6210643649101257),
 ('perception', 0.6148685216903687),
 ('rests', 0.6061360239982605),
 ('obstacle', 0.5973332524299622),
 ('thalamus', 0.591398298740387),
 ('react', 0.5808981657028198)]

In [47]:
# determine words most similar to protect
congress_w2v.wv.most_similar('sexuality')

[('helps', 0.6987822651863098),
 ('breastfeeding', 0.6119071841239929),
 ('abstinence', 0.6102300882339478),
 ('adulthood', 0.6010406613349915),
 ('culturally', 0.5891531109809875),
 ('literacy', 0.5780025720596313),
 ('preschool', 0.5665664076805115),
 ('paraprofessionals', 0.566216230392456),
 ('adolescent', 0.5647426247596741),
 ('stem', 0.5641573071479797)]

In [50]:
# determine words most similar to protect
congress_w2v.wv.most_similar('contraception')

[('infertility', 0.7163260579109192),
 ('infections', 0.6544108986854553),
 ('childbirth', 0.6419183611869812),
 ('postpartum', 0.6369994878768921),
 ('infection', 0.6148290038108826),
 ('prenatal', 0.614122211933136),
 ('interventions', 0.6081894040107727),
 ('motherhood', 0.5975733995437622),
 ('breastfeeding', 0.5904054045677185),
 ('intervention', 0.5898716449737549)]

In [52]:
# find words most disimilar to [INSERT WORD HERE]
congress_w2v.wv.doesnt_match(['abortion', 'healthcare', 'right', 'choice'])

'healthcare'

In [None]:
# use semantic equations to find words most analogous to [INSERT ANALOGY HERE]
# equation: X + Y - Z = ___ (X is to Z as Y is to ___)

#### Vector Visualization

In [None]:
# dimensionality reduction
numWords = 50
indices = random_indices(len(congress_w2v.wv.index_to_key), numWords)
targetWords = congress_w2v.wv.index_to_key[indices]

### <font color="red">*Exercise 2*</font>

<font color="red">Construct cells immediately below this that embed documents related to your final project using `doc2vec`, and explore the relationship between different documents and the word vectors you analyzed in the last exercise. Consider the most similar words to critical documents, analogies (doc _x_ + word _y_), and other additions and subtractions that reveal the structure of similarity and difference within your semantic space. What does this pattern reveal about the documentary organization of your semantic space?

### <font color="red">*Exercise 3*</font>

<font color="red">Construct cells immediately below this that embed documents related to your final project, then generate meaningful semantic dimensions based on your theoretical understanding of the semantic space (i.e., by subtracting semantically opposite word vectors) and project another set of word vectors onto those dimensions. Interpret the meaning of these projections for your analysis. Which of the dimensions you analyze explain the most variation in the projection of your words and why?

<font color="red">***Stretch***: Average together multiple antonym pairs to create robust semantic dimensions. How do word projections on these robust dimensions differ from single-pair dimensions?

### <font color="red">*Exercise 4*</font>

<font color="red">Construct cells immediately below this that align word embeddings over time or across domains/corpora. Interrogate the spaces that result and ask which words changed most and least over the entire period or between contexts/corpora. What does this reveal about the social game underlying your space?