The meaning of words often changes over time.  In this homework, you will explore this phenomenon by identifying shifts in word meaning over the space of one hundred years by examining word embeddings trained on historical data (largely published before 1923) and those trained on contemporary texts.

In [1]:
import re
from gensim.models import KeyedVectors
import operator

In [14]:
wiki = KeyedVectors.load_word2vec_format("../data/glove.6B.50d.50K.txt", binary=False)

In [15]:
guten = KeyedVectors.load_word2vec_format("../data/gutenberg.200.vectors.50K.txt", binary=False)

Q1. Before we jump in, select 5 words whose senses you believe have changed over the period of the past 100 years. Ensure they are in the vocabulary of both models.  Explain the two different meanings they have.  This is an important step in stating your beliefs before you examine any empirical evidence; do not change these terms after you have run the models you develop below.  (Here we are only evaluating the rationales, not whether the terms *actually* undergo sense change, as measured below.)

In [16]:
# fill in terms here
terms=["gay", "bureau", "job", "intelligence", "cabinet"]
for term in terms:
    if term not in wiki or term not in guten:
        print("%s missing!" % term)

**Q1 response**.

Q2. Find the words that have changed the most by calculating the number of words that overlap in their 50 nearest neighbors.  That is, let $\mathcal{N}_{guten}(\textrm{awesome})$ be the 50 nearest neighbors for the word "awesome" in the Gutenberg embeddings and $\mathcal{N}_{wiki}(\textrm{awesome})$ be the 50 nearest neighbors for "awesome" in the Wikipedia embeddings.  Calculate the size of $\mathcal{N}_{guten}(\textrm{awesome}) \cap \mathcal{N}_{wiki}(\textrm{awesome})$.  Under this method, the words that share the *fewest* neighbors have moved the furthest apart.  Display the 100 words that have moved the furthest apart and the 100 words that have remained the closest together, along with their intersection score.

In [17]:
def find_nn_overlap(vecs1, vecs2, num_neighbors=50):
    
    def overlap(neighbors1, neighbors2):
        set1=set([x for x,y in neighbors1])
        set2=set([x for x,y in neighbors2])

        return len(set1.intersection(set2))

    sims={}

    for idx, word in enumerate(vecs1.key_to_index):
        if word in vecs2.key_to_index:
            sim=overlap(vecs1.most_similar(word, topn=num_neighbors), vecs2.most_similar(word, topn=num_neighbors))
            sims[word]=sim
    return sims

In [18]:
sims=find_nn_overlap(wiki, guten)

In [19]:
# Moved the furthest away
sorted_x = sorted(sims.items(), key=operator.itemgetter(1), reverse=False)
for k, v in sorted_x[0:100]:
    print(v,k)    

0 -
0 '
0 us
0 _
0 against
0 between
0 international
0 former
0 security
0 public
0 according
0 –
0 top
0 major
0 based
0 bush
0 washington
0 final
0 israel
0 al
0 film
0 air
0 prime
0 companies
0 least
0 talks
0 today
0 record
0 ...
0 deal
0 round
0 total
0 include
0 further
0 media
0 research
0 board
0 growth
0 special
0 named
0 term
0 korea
0 opposition
0 future
0 album
0 index
0 face
0 energy
0 key
0 post
0 job
0 middle
0 located
0 clear
0 un
0 available
0 civil
0 indian
0 ground
0 pressure
0 legal
0 budget
0 currently
0 vice
0 common
0 means
0 no.
0 independent
0 `
0 created
0 cooperation
0 poor
0 firm
0 cross
0 includes
0 scheduled
0 net
0 sides
0 award
0 addition
0 terms
0 joint
0 charge
0 atlanta
0 spending
0 coalition
0 lee
0 turkey
0 potential
0 fans
0 goals
0 unit
0 significant
0 upon
0 intelligence
0 employees
0 claims
0 terrorism
0 —
0 mass


In [20]:
# Remained the closest together
sorted_x = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)
for k, v in sorted_x[0:100]:
    print(v,k)    

44 38
44 37
44 39
43 48
43 43
43 49
42 59
41 36
41 42
41 41
41 46
38 6
38 33
37 57
36 32
35 5
35 44
35 55
35 1869
34 7
34 45
34 65
34 62
34 1865
34 1866
34 1843
33 2
33 8
33 9
33 47
33 56
33 1856
33 1844
32 35
32 52
32 67
32 68
32 61
32 1902
32 1850
32 1854
31 10
31 4
31 14
31 34
31 53
31 fifteen
31 1907
31 1858
31 1840
31 1845
31 1831
30 11
30 13
30 16
30 19
30 21
30 31
30 77
30 1908
30 1899
30 fourteen
30 1861
30 1859
30 1855
30 1829
29 1
29 3
29 17
29 40
29 iowa
29 kentucky
29 54
29 12th
29 63
29 13th
29 1905
29 1901
29 1903
29 1885
29 1876
29 thirteenth
28 20
28 15
28 12
28 18
28 25
28 51
28 4th
28 16th
28 15th
28 14th
28 twelve
28 5th
28 78
28 1900
28 79
28 7th
28 6th
28 9th


Now let's look at how much the candidate terms you defined above have changed their meaning as measured in these embeddings.  First, we can just print their neighborhoods:

In [21]:
def print_top(word):
    print("=== %s ===\n" % word)
    print("Gutenberg:")
    for k, v in guten.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k))

    print()
    print("Wikipedia:")
    for k, v in wiki.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k)) 
    print()

In [22]:
for term in terms:
    print_top(term)

=== gay ===

Gutenberg:
0.678	merry
0.659	joyous
0.645	lively
0.622	debonair
0.613	gayest
0.609	light-hearted
0.606	vivacious
0.606	cheerful
0.599	blithe
0.593	carefree

Wikipedia:
0.850	lesbian
0.801	abortion
0.801	gays
0.799	lesbians
0.790	homosexual
0.765	homosexuals
0.763	sex
0.725	bisexual
0.707	lgbt
0.699	transgender

=== bureau ===

Gutenberg:
0.680	desk
0.645	drawers
0.643	book-case
0.642	cupboard
0.639	bookcase
0.639	drawer
0.633	closet
0.609	writing-desk
0.605	washstand
0.604	dressing-table

Wikipedia:
0.872	department
0.805	agency
0.793	commerce
0.776	according
0.755	report
0.740	statistics
0.737	ministry
0.730	official
0.730	xinhua
0.727	directorate

=== job ===

Gutenberg:
0.606	eddication
0.593	chanst
0.590	stunt
0.589	meself
0.585	sassy
0.585	get-away
0.578	derned
0.572	diggin
0.569	sumpthing
0.566	dope

Wikipedia:
0.837	getting
0.825	doing
0.822	better
0.817	get
0.810	needs
0.809	good
0.805	working
0.800	putting
0.797	done
0.796	going

=== intelligence ===

Gutenberg:
0

**check+**. Let's make this a little more precise.  Rank all terms by the overlap score you created above, so that words with scores closer to 0 (i.e., no overlap in nearest neighbors) are ranked higher (i.e., closer to position 1). Measure how good your guesses were by calculating their [mean reciprocal rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) within this list.  (Again, we're not evaluating how good your guesses were above, but rather the correctness of your implementation of MRR.)

In [23]:
def get_rank_of_queries(terms, sims):
    sorted_x = sorted(sims.items(), key=operator.itemgetter(1), reverse=False)
    queryRank=0.
    queryN=0.
    ranked_list=[x for x,_ in sorted_x]
    for term in terms:
        print(term, ranked_list.index(term))
        queryRank+=1./ranked_list.index(term)
        queryN+=1
    
    print("MRR: %.3f" % (queryRank/queryN))

In [24]:
get_rank_of_queries(terms, sims)

gay 216
bureau 142
job 50
intelligence 94
cabinet 104
MRR: 0.010
