# Week 4 - Word Embeddings Supplemental

This notebook contains two additional uses for word embeddings

For this notebook we will be using the following packages

In [2]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
import gensim#For word2vec, etc
import requests #For downloading our datasets
import nltk #For stop words and stemmers
import numpy as np #For arrays
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import seaborn #Makes the graphics look nicer
import sklearn.metrics.pairwise #For cosine similarity
import sklearn.manifold #For T-SNE
import sklearn.decomposition #For PCA
import copy

#gensim uses a couple of deprecated features
#we can't do anything about them so lets ignore them 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#This 'magic' command makes the plots work better
#in the notebook, don't use it outside of a notebook.
#Also you can ignore the warning
%matplotlib inline

import os #For looking through files
import os.path #For managing file paths

# The Score Function

The score function is a simple calculation developed by [Matt Taddy](https://arxiv.org/pdf/1504.07295.pdf) to calculate the likelihood that a given text would have been generated by a word-embedding model by summing the inner product between each pair of the text's word vectors. 

Here, we explore this using a model trained with millions of resumes from the CareerBuilder website (we can't share the private resumes...but we can share a model built with them :-):

In [None]:
resume_model  = gensim.models.word2vec.Word2Vec.load('../data/resumeAll.model')

We can examine the vacabularies of this model by building a word-index map:

In [None]:
vocab = resume_model.index2word

Let's just load the sample and take a look at it. The sentences in each job description are already tokenized and normalized.

In [None]:
sampleDF = pandas.read_csv('../data/SampleJobAds.csv', index_col = False)
#We need to convert the last couple columns from strings to lists
sampleDF['tokenized_sents'] = sampleDF['tokenized_sents'].apply(lambda x: eval(x))
sampleDF['normalized_sents'] = sampleDF['normalized_sents'].apply(lambda x: eval(x))
sampleDF

Let's define a function to calculate the likelihood of each job description. The idea is borrowed from [Matt Taddy](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/deepir.ipynb), who shows how a document can be characterized as the inner product of the distance between its words. In other words, this analysis will show which job ads are most likely to find an appropriate pool of workers in the resume bank that generated our word embedding.  

In [None]:
def adprob(ad, model):
    sen_scores = model.score(ad, len(ad))
    ad_score = sen_scores.mean()
    return ad_score

Let's apply this function to every job description.

In [None]:
sampleDF['likelihood'] = sampleDF['normalized_sents'].apply(lambda x: adprob(x, resume_model))

Let's take a look at the top 5 job descriptions that have the highest likelihood.

In [None]:
for ad in sampleDF.sort_values(by = 'likelihood', ascending = False)['jobDescription'][:5]:
    print (ad + '\n\n')

Let's take a look at the bottom 5 job descriptions that have the lowest likelihood to be matched by the resumes.

In [None]:
for ad in sampleDF.sort_values(by = 'likelihood')['jobDescription'][:5]:
    print (ad + '\n\n')

We can do the same for phrases corresponding to job skills.

In [None]:
adprob([["python", "programming"]], resume_model)

In [None]:
adprob([["basic", "programming"]], resume_model)

Basic programming appears to be more likely in this pool of resumes than python programming. 

We can also do some simple statistics. Unfortunately, we don't have a large sample here. Nevertheless, let's first look at the mean likelihood score of each hiring organization. Some organizations will do well to hire on CareerBuilder...while others will not.

In [None]:
sampleDF.groupby("hiringOrganization_organizationName")[['likelihood']].mean().sort_values('likelihood', ascending = False)

We can also look at the mean likelihood of each state.

In [None]:
sampleDF.groupby("jobLocation_address_region")[['likelihood']].mean().sort_values('likelihood', ascending = False)

You would increase the sample size if you want to do a more serious study.

## <span style="color:red">*Exercise 1a*</span>

<span style="color:red">**Do only 1a or 1b.** Construct cells immediately below this that calculate the scores for a small sample of documents from outside your corpus to identify which are *closest* to your corpus. Then calculate the scores for a few phrases or sentences to identify the ones most likely to have appeared in your corpus. Interrogate patterns associated with these document/phrase scores (e.g., which companies produced job ads most or least likely to find jobseekers in the resume corpus?) What do these patterns suggest about the boundaries of your corpus?

In [3]:
amazonRev5DF = pandas.read_csv('../4-Word-Embedding/amazonRev5DF.csv', index_col = 0)
amazonRev5DF[1:10]

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,tokenized_sents,normalized_sents,tokenized_words,normalized_words,uniqueID,TaggedReviews
7,616719923X,"[2, 3]",5,Creamy white chocolate infused with Matcha gre...,"07 5, 2013",A3FIVHUOGMUMPK,greenlife,So Delicious!!,1372982400,"[['Creamy', 'white', 'chocolate', 'infused', '...","[['creamy', 'white', 'chocolate', 'infused', '...","['Creamy', 'white', 'chocolate', 'infused', 'w...","['creamy', 'white', 'chocolate', 'infused', 'm...",A3FIVHUOGMUMPK1372982400,"LabeledSentence(['creamy', 'white', 'chocolate..."
8,616719923X,"[0, 0]",5,After hearing mixed opinions about these Kit K...,"06 14, 2013",A27FSPAMTQF1J8,Japhyl,These are my favorite candies ever!,1371168000,"[['After', 'hearing', 'mixed', 'opinions', 'ab...","[['hearing', 'mixed', 'opinions', 'kit', 'kats...","['After', 'hearing', 'mixed', 'opinions', 'abo...","['hearing', 'mixed', 'opinions', 'kit', 'kats'...",A27FSPAMTQF1J81371168000,"LabeledSentence(['hearing', 'mixed', 'opinions..."
10,616719923X,"[6, 8]",5,I ordered these in Summer so they of course ar...,"10 2, 2013",A220GN2X2R47JE,Jeremy,Amazing!,1380672000,"[['I', 'ordered', 'these', 'in', 'Summer', 'so...","[['ordered', 'summer', 'course', 'arrived', 'm...","['I', 'ordered', 'these', 'in', 'Summer', 'so'...","['ordered', 'summer', 'course', 'arrived', 'me...",A220GN2X2R47JE1380672000,"LabeledSentence(['ordered', 'summer', 'course'..."
11,616719923X,"[2, 3]",5,These are definitely THE BEST candy bar out th...,"05 26, 2013",A3C5Z05IKSSFB9,"M. Magpoc ""maliasuperstar""",I wish I could find these in a store instead o...,1369526400,"[['These', 'are', 'definitely', 'THE', 'BEST',...","[['definitely', 'best', 'candy', 'bar'], ['wis...","['These', 'are', 'definitely', 'THE', 'BEST', ...","['definitely', 'best', 'candy', 'bar', 'wish',...",A3C5Z05IKSSFB91369526400,"LabeledSentence(['definitely', 'best', 'candy'..."
12,616719923X,"[0, 0]",5,Yes - this is one of the most expensive candie...,"07 6, 2013",AHA6G4IMEMAJR,"M. Zinn ""mczinn""",Thank goodness they are expensive,1373068800,"[['Yes', '-', 'this', 'is', 'one', 'of', 'the'...","[['yes', 'one', 'expensive', 'candies', 'aroun...","['Yes', '-', 'this', 'is', 'one', 'of', 'the',...","['yes', 'one', 'expensive', 'candies', 'around...",AHA6G4IMEMAJR1373068800,"LabeledSentence(['yes', 'one', 'expensive', 'c..."
13,616719923X,"[0, 0]",5,"I love the green tea kitkat, taste so good, no...","06 8, 2013",A1Q2E3W9PRG313,Sabrina,it is good,1370649600,"[['I', 'love', 'the', 'green', 'tea', 'kitkat'...","[['love', 'green', 'tea', 'kitkat', 'taste', '...","['I', 'love', 'the', 'green', 'tea', 'kitkat',...","['love', 'green', 'tea', 'kitkat', 'taste', 'g...",A1Q2E3W9PRG3131370649600,"LabeledSentence(['love', 'green', 'tea', 'kitk..."
16,9742356831,"[0, 0]",5,This curry paste makes a delicious curry. I j...,"05 28, 2013",A23RYWDS884TUL,Another Freak,Delicious!,1369699200,"[['This', 'curry', 'paste', 'makes', 'a', 'del...","[['curry', 'paste', 'makes', 'delicious', 'cur...","['This', 'curry', 'paste', 'makes', 'a', 'deli...","['curry', 'paste', 'makes', 'delicious', 'curr...",A23RYWDS884TUL1369699200,"LabeledSentence(['curry', 'paste', 'makes', 'd..."
17,9742356831,"[1, 2]",5,I've purchased different curries in the grocer...,"09 17, 2012",A945RBQWGZXCK,Cheryl,Great flavor,1347840000,"[['I', ""'ve"", 'purchased', 'different', 'curri...","[['purchased', 'different', 'curries', 'grocer...","['I', ""'ve"", 'purchased', 'different', 'currie...","['purchased', 'different', 'curries', 'grocery...",A945RBQWGZXCK1347840000,"LabeledSentence(['purchased', 'different', 'cu..."
18,9742356831,"[2, 2]",5,I love ethnic foods and to cook them. I recent...,"08 3, 2013",A1TCSC0YWT82Q0,GinSing,OMG! What a treasure find!,1375488000,"[['I', 'love', 'ethnic', 'foods', 'and', 'to',...","[['love', 'ethnic', 'foods', 'cook'], ['recent...","['I', 'love', 'ethnic', 'foods', 'and', 'to', ...","['love', 'ethnic', 'foods', 'cook', 'recently'...",A1TCSC0YWT82Q01375488000,"LabeledSentence(['love', 'ethnic', 'foods', 'c..."


In [10]:
amazonRev5W2V = gensim.models.word2vec.Word2Vec(amazonRev5DF['normalized_sents'].sum(),hs=1,negative=0)

In [5]:
amazonRev1DF = pandas.read_csv('../4-Word-Embedding/amazonRev1DF.csv', index_col = 0)
amazonRev1DF[1:10]

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,tokenized_sents,normalized_sents,tokenized_words,normalized_words,uniqueID,TaggedReviews
32,B00004S1C5,"[8, 11]",1,This product is no where near natural / organi...,"03 29, 2013",A14YSMLYLJEMET,Amazon Customer,Not natural/organic at all,1364515200,"[['This', 'product', 'is', 'no', 'where', 'nea...","[['product', 'near', 'natural', 'wish', 'seen'...","['This', 'product', 'is', 'no', 'where', 'near...","['product', 'near', 'natural', 'wish', 'seen',...",A14YSMLYLJEMET1364515200,"LabeledSentence(['product', 'near', 'natural',..."
75,B0000CCZYY,"[1, 4]",1,"Licorice is my favorite candy, and it promotes...","04 5, 2013",A3OH4OZFZGEH75,Amazon Customer,Not soft at all. Basically same as cheap licor...,1365120000,"[['Licorice', 'is', 'my', 'favorite', 'candy',...","[['licorice', 'favorite', 'candy', 'promotes',...","['Licorice', 'is', 'my', 'favorite', 'candy', ...","['licorice', 'favorite', 'candy', 'promotes', ...",A3OH4OZFZGEH751365120000,"LabeledSentence(['licorice', 'favorite', 'cand..."
82,B0000CCZYY,"[6, 11]",1,"This is an awesome product, natural, not a lot...","05 7, 2013",A2OUNVRPRWH0,"The Kittie ""Kittie""",Love this candy!,1367884800,"[['This', 'is', 'an', 'awesome', 'product', ',...","[['awesome', 'product', 'natural', 'lot', 'ing...","['This', 'is', 'an', 'awesome', 'product', ','...","['awesome', 'product', 'natural', 'lot', 'ingr...",A2OUNVRPRWH01367884800,"LabeledSentence(['awesome', 'product', 'natura..."
85,B0000CD06J,"[0, 3]",1,"As soon as I had a couple of sips, my eczema s...","03 6, 2013",AX04H2SPKO02S,"J. Wang ""jyswang""",NOT gluten free,1362528000,"[['As', 'soon', 'as', 'I', 'had', 'a', 'couple...","[['soon', 'couple', 'sips', 'eczema', 'started...","['As', 'soon', 'as', 'I', 'had', 'a', 'couple'...","['soon', 'couple', 'sips', 'eczema', 'started'...",AX04H2SPKO02S1362528000,"LabeledSentence(['soon', 'couple', 'sips', 'ec..."
162,B0000CNU1X,"[0, 0]",1,unsure if I just got a bad batch or what...the...,"01 23, 2013",A1M9L949MA66I3,orlandodawg,Not good,1358899200,"[['unsure', 'if', 'I', 'just', 'got', 'a', 'ba...","[['unsure', 'got', 'bad', 'batch', 'flavor', '...","['unsure', 'if', 'I', 'just', 'got', 'a', 'bad...","['unsure', 'got', 'bad', 'batch', 'flavor', 'b...",A1M9L949MA66I31358899200,"LabeledSentence(['unsure', 'got', 'bad', 'batc..."
217,B0000DGDMO,"[1, 2]",1,Misleading. The reason this is cheaper than t...,"08 31, 2012",A30JPZ9TZ7I61U,"Christopher Barrett ""Evil Corgi""",Why is the picture showing the 24 pack?????,1346371200,"[['Misleading', '.'], ['The', 'reason', 'this'...","[['misleading'], ['reason', 'cheaper', 'flavor...","['Misleading', '.', 'The', 'reason', 'this', '...","['misleading', 'reason', 'cheaper', 'flavors',...",A30JPZ9TZ7I61U1346371200,"LabeledSentence(['misleading', 'reason', 'chea..."
265,B0000DID5R,"[6, 27]",1,"Well, I guess I'm the fly in this reviewer oin...","01 14, 2007",A34PAZQ73SL163,"Bernard Chapin ""Ora Et Labora!""",The Only One I Avoid.,1168732800,"[['Well', ',', 'I', 'guess', 'I', ""'m"", 'the',...","[['well', 'guess', 'fly', 'reviewer', 'ointmen...","['Well', ',', 'I', 'guess', 'I', ""'m"", 'the', ...","['well', 'guess', 'fly', 'reviewer', 'ointment...",A34PAZQ73SL1631168732800,"LabeledSentence(['well', 'guess', 'fly', 'revi..."
270,B0000DID5R,"[3, 4]",1,"O.k., I'm going to offer a counterpoint to all...","11 11, 2012",A2MPW1R13SHA2S,Dangrenade,"Unpleasant Heat, and No Flavor",1352592000,"[['O.k.', ',', 'I', ""'m"", 'going', 'to', 'offe...","[['going', 'offer', 'counterpoint', 'positive'...","['O.k.', ',', 'I', ""'m"", 'going', 'to', 'offer...","['going', 'offer', 'counterpoint', 'positive',...",A2MPW1R13SHA2S1352592000,"LabeledSentence(['going', 'offer', 'counterpoi..."
290,B0000DID5R,"[4, 24]",1,I just tried this sauce moments ago. Someone h...,"03 1, 2011",A3FHWQ3H3ZT2YE,"Patrice M. Christian ""Trixie.in.Dixie""",Maybe my taste buds are different.,1298937600,"[['I', 'just', 'tried', 'this', 'sauce', 'mome...","[['tried', 'sauce', 'moments', 'ago'], ['someo...","['I', 'just', 'tried', 'this', 'sauce', 'momen...","['tried', 'sauce', 'moments', 'ago', 'someone'...",A3FHWQ3H3ZT2YE1298937600,"LabeledSentence(['tried', 'sauce', 'moments', ..."


In [11]:
def adprob(ad, model):
    sen_scores = model.score(ad, len(ad))
    ad_score = sen_scores.mean()
    return ad_score

In [12]:
amazonRev1DF['likelihood'] = amazonRev1DF['normalized_sents'].apply(lambda x: adprob(x, amazonRev5W2V))

In [14]:
for ad in amazonRev1DF.sort_values(by = 'likelihood', ascending = False)['reviewText'][:5]:
    print (ad + '\n\n')

i only got one little packet of these, not the 30 count that is says.it tasted like medicine and made me very jittery.it also made my heart race.i didn't like it


The tea I received was old, old, old, the flavor was gone, and none of it had any taste to it. The tins are very cute, but unless you're into collecting cute little tins, I would skip this one.


This item tastes odd and I find the experation date of less than two months from the date of receipt is too short. AT least I was able to return it even tho I loose half the price of this order just to return it but better safe than sorry. From now on I will get my mayo at my local store even if I can't find the 64 oz. jar.


I have found Stash teas to be very hit or miss.  I love Moroccan Mint teas in general and decided to give Stash's a try.  Upon opening the packet there was no mint scent at all.  That was a red alert as mint has a crisp and strong scent for a very long time (I found a box of mint tea in my pantry that was two y

In [24]:
for ad in amazonRev1DF.sort_values(by = 'likelihood')['reviewText'][:20]:
    print (ad )

nan
nan
nan
nan
DOES NOT WORK for:-Oil Pulling-Skin Nutrition-Cooking
didn't like it
Yuk, just yuk.
so nasty... the taste was like having herpes.. ewwwwwwwwwwwwwwwwwwwwwwwwhahahah, lOLnasty, toilet, garbage, this product should not even exist... it's so nasty.
Awful tasting....
These are probably sensible but taste awful, you can't even tell which fruit you are eating, yuckkkkkkkkkkkkkkkk.
my dog loved them. I can not stand the, this is not jerky at all.
Not at all what I expected - no mellow flavor just cardboard.  Very disappointed.  Won't buy it again.
I am all about adventure but this was worse than the sinkig of the titanic...What ever flavor this is supposed to be it isn't, it is just funkyyyyyy....
This is not what I was looking for at all. I am looking for simple microwave products. I will not order them again.
I served this for Thanksgiving and I and my guests did not like any of the flavors.  It was too preservative tasting.
This French Vanilla Cappuccino was just plain awful

In [36]:
adprob([["tasty"]], amazonRev5W2V)

0.0

In [41]:
amazonRev5W2V.index2word

AttributeError: 'Word2Vec' object has no attribute 'index2word'

# Linguistic Change

Below is code that aligns the dimensions of multiple embeddings arrayed over time or some other dimension and allow identification of semantic chanage as the word vectors change their loadings for focal words. This code comes from the approach piloted at Stanford by William Hamilton, Daniel Jurafsky and Jure Lescovec [here](https://arxiv.org/pdf/1605.09096.pdf). 

In [None]:
def calc_syn0norm(model):
    """since syn0norm is now depricated"""
    return (model.wv.syn0 / np.sqrt((model.wv.syn0 ** 2).sum(-1))[..., np.newaxis]).astype(np.float32)

def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
    """Procrustes align two gensim word2vec models (to allow for comparison between same word across models).
    Code ported from HistWords <https://github.com/williamleif/histwords> by William Hamilton <wleif@stanford.edu>.
    (With help from William. Thank you!)
    First, intersect the vocabularies (see `intersection_align_gensim` documentation).
    Then do the alignment on the other_embed model.
    Replace the other_embed model's syn0 and syn0norm numpy matrices with the aligned version.
    Return other_embed.
    If `words` is set, intersect the two models' vocabulary with the vocabulary in words (see `intersection_align_gensim` documentation).
    """
    base_embed = copy.copy(base_embed)
    other_embed = copy.copy(other_embed)
    # make sure vocabulary and indices are aligned
    in_base_embed, in_other_embed = intersection_align_gensim(base_embed, other_embed, words=words)

    # get the embedding matrices
    base_vecs = calc_syn0norm(in_base_embed)
    other_vecs = calc_syn0norm(in_other_embed)

    # just a matrix dot product with numpy
    m = other_vecs.T.dot(base_vecs) 
    # SVD method from numpy
    u, _, v = np.linalg.svd(m)
    # another matrix operation
    ortho = u.dot(v) 
    # Replace original array with modified one
    # i.e. multiplying the embedding matrix (syn0norm)by "ortho"
    other_embed.wv.syn0norm = other_embed.wv.syn0 = (calc_syn0norm(other_embed)).dot(ortho)
    return other_embed
    
def intersection_align_gensim(m1,m2, words=None):
    """
    Intersect two gensim word2vec models, m1 and m2.
    Only the shared vocabulary between them is kept.
    If 'words' is set (as list or set), then the vocabulary is intersected with this list as well.
    Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2).
    These indices correspond to the new syn0 and syn0norm objects in both gensim models:
        -- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
        -- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2
    The .vocab dictionary is also updated for each model, preserving the count but updating the index.
    """

    # Get the vocab for each model
    vocab_m1 = set(m1.wv.vocab.keys())
    vocab_m2 = set(m2.wv.vocab.keys())

    # Find the common vocabulary
    common_vocab = vocab_m1&vocab_m2
    if words: common_vocab&=set(words)

    # If no alignment necessary because vocab is identical...
    if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
        return (m1,m2)

    # Otherwise sort by frequency (summed for both)
    common_vocab = list(common_vocab)
    common_vocab.sort(key=lambda w: m1.wv.vocab[w].count + m2.wv.vocab[w].count,reverse=True)

    # Then for each model...
    for m in [m1,m2]:
        # Replace old syn0norm array with new one (with common vocab)
        indices = [m.wv.vocab[w].index for w in common_vocab]
        old_arr = calc_syn0norm(m)
        new_arr = np.array([old_arr[index] for index in indices])
        m.wv.syn0norm = m.wv.syn0 = new_arr

        # Replace old vocab dictionary with new one (with common vocab)
        # and old index2word with new one
        m.index2word = common_vocab
        old_vocab = m.wv.vocab
        new_vocab = {}
        for new_index,word in enumerate(common_vocab):
            old_vocab_obj=old_vocab[word]
            new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index, count=old_vocab_obj.count)
        m.wv.vocab = new_vocab

    return (m1,m2)

In order to explore this, let's get some data that follows a time trend. We'll look at conference proceedings from the American Society for Clinical Oncologists.

In [None]:
ascoDF = pandas.read_csv("../data/ASCO_abstracts.csv", index_col=0)

Prepare for wor2vec

In [None]:
ascoDF['tokenized_sents'] = ascoDF['Body'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
ascoDF['normalized_sents'] = ascoDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, stopwordLst = lucem_illud.stop_words_basic) for s in x])

We will be creating many embeddings so we have created this function to do most of the work. It creates two collections of embeddings, one the original and one the aligned.

In [None]:
def compareModels(df, category, sort = True):
    """If you are using time as your category sorting is important"""
    embeddings_raw = {}
    cats = sorted(set(df[category]))
    for cat in cats:
        #This can take a while
        print("Embedding {}".format(cat), end = '\r')
        subsetDF = df[df[category] == cat]
        #You might want to change the W2V parameters
        embeddings_raw[cat] = gensim.models.word2vec.Word2Vec(subsetDF['normalized_sents'].sum())
    #These are much quicker
    embeddings_aligned = {}
    for catOuter in cats:
        embeddings_aligned[catOuter] = [embeddings_raw[catOuter]]
        for catInner in cats:
            embeddings_aligned[catOuter].append(smart_procrustes_align_gensim(embeddings_aligned[catOuter][-1], embeddings_raw[catInner]))
    return embeddings_raw, embeddings_aligned

Now we generate the models

In [None]:
rawEmbeddings, comparedEmbeddings = compareModels(ascoDF, 'Year')

We need to compare them across all permutions so we will define another function to help, we will be using 1 - cosine similarity as that gives a more intitive range of 0-2 with low values meaning little change and high meaning lots of change

In [None]:
def getDivergenceDF(word, embeddingsDict):
    dists = []
    cats = sorted(set(embeddingsDict.keys()))
    dists = {}
    for cat in cats:
        dists[cat] = []
        for embed in embeddingsDict[cat][1:]:
            dists[cat].append(np.abs(1 - sklearn.metrics.pairwise.cosine_similarity(np.expand_dims(embeddingsDict[cat][0][word], axis = 0),
                                                                             np.expand_dims(embed[word], axis = 0))[0,0]))
    return pandas.DataFrame(dists, index = cats)

Lets look at a couple words

In [None]:
targetWord = 'breast'

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

In [None]:
targetWord = 'triple'

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

We can also ask which words changed the most

In [None]:
def findDiverence(word, embeddingsDict):
    cats = sorted(set(embeddingsDict.keys()))
    
    dists = []
    for embed in embeddingsDict[cats[0]][1:]:
        dists.append(1 - sklearn.metrics.pairwise.cosine_similarity(np.expand_dims(embeddingsDict[cats[0]][0][word], axis = 0), np.expand_dims(embed[word], axis = 0))[0,0])
    return sum(dists)

def findMostDivergent(embeddingsDict):
    words = []
    for embeds in embeddingsDict.values():
        for embed in embeds:
            words += list(embed.wv.vocab.keys())
    words = set(words)
    print("Found {} words to compare".format(len(words)))
    return sorted([(w, findDiverence(w, embeddingsDict)) for w in words], key = lambda x: x[1], reverse=True)
    

In [None]:
wordDivergences = findMostDivergent(comparedEmbeddings)

The most divergent words are:

In [None]:
wordDivergences[:10]

And the least

In [None]:
wordDivergences[-10:]

In [None]:
targetWord = wordDivergences[0][0]

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

In [None]:
targetWord = wordDivergences[-1][0]

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

## <span style="color:red">*Exercise 1b*</span>

<span style="color:red">**Do only 3a or 3b.** Construct cells immediately below this that align word embeddings over time. Interrogate the spaces that result and ask which words change most of the whole period. What does this reveal about the social game underlying your space?