# Exercise 2.1 - Train and Explore Word Vectors

# Working with word vectors
What's the NBA of baseball? One way to answer this is to form an analogy: basketball is to NBA as baseball is to ... ? The answer is MLB (Major League Baseball). One of the more interested developments in the past decade for NLP has been new methods for learning word vectors that encode both the meaning of words _and_ relational knowledge to solve these kinds of analogy tests. In this notebook, you'll see how these work for yourself!

The homework is broken up in several steps of preprocessing code (a constant fact of life for NLP), training a [word2vec](https://en.wikipedia.org/wiki/Word2vec) model using the [gensim](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html) library and then exploring what the vectors have learned. For this notebook, you'll only train on a small set of the data to keep it quick (though this still learns a lot, as you'll see!); we've pre-computed a word2vec model for you on the full data as well, which you will use later in your evaluations. If you want to read up a bit more on word2vec, there's many great [blog](https://israelg99.github.io/2017-03-23-Word2Vec-Explained/) [posts](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and you could check out the [original paper](https://arxiv.org/pdf/1301.3781.pdf)

This notebook will show you how to train your own vectors in the future too. Methods that learn distributional word vectors (like word2vec, LSA, or even counting) depend on the corpus they're trained on. Here, we're using our list of Wikipedia biographies. This corpus is quite rich in people, places, occupations, and all the things that people do. What all might a model learn from this? You'll see some of it in the tests we have prepared for you, but the model will know a lot more than what we've shown. Once you finish the core tasks, in your own exploration see what else the model has learned. Are there new types of analogies it encodes?

As usual in NLP, there are many options and hyperparameters to choose for word2vec. We've chosen a few to show you here (e.g., tokenization options) to help you get started exploring the space. You can try training on more data, adjust the window size, or the minimum token frequency to see how these impact performance and what the model learns. If you discover something interesting, feel free to discuss!

Finally, if you're feeling ambitious, try plotting pairs of words that have the same relationship, e.g., "NBA", "basketball", "MLB" and "baseball" and see if they have a geometric  relationship. As an example of this type of plot, see the plots for the word2vec alternative [glove](https://nlp.stanford.edu/projects/glove/) that also encodes this type of information. For plotting, try showing the first two principle components of the vectors (you'll first need to run PCA) or use [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). If you make any interesting plots or discover something cool, please feel free to share in Slack!

In [1]:
import gensim
import gzip
import json
import matplotlib.pyplot as plt
import numpy as np
import pickle
import re
import pandas as pd
from tqdm.notebook import tqdm
from gensim.models.word2vec import Word2Vec
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.test.utils import datapath

In [2]:
RANDOM_SEED = 655

To reduce the burden on memory, we use only the first 10,000 biographies and save them into a list that we'll call `bios`. We've pre-saved them as a pickle object `bios.p` so that you can load it directly.

In [3]:
bios = []
with open('assets/bios.p', 'rb') as f:
    bios = pickle.load(f)

Use nltk `word_tokenize` to split the biographies in `bios` into words and each bio's words as a separate list into a list called `nltk_tokenized_bios`. Try wrapping these in a `tqdm` call to see how long it takes

In [4]:
# YOUR CODE HERE
nltk_tokenized_bios = [word_tokenize(i) for i in tqdm(bios)]

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=10000.0), HTML(value='')))




Let's compare NLTK's tokenization with a regular expression-based extraction. Write a regular expression to find all sequences of word characters (`\w+`) for each biography in `bios` into words and each bio's words as a separate list into a list called `re_tokenized_bios`. Try wrapping these in a `tqdm` call to see how long it takes. 

In [5]:
# YOUR CODE HERE

re_tokenized_bios = [re.findall(r"\w+",i) for i in tqdm(bios)]

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=10000.0), HTML(value='')))




A surprising speed difference! Let's count how many unique tokens we found from each. Use a `Counter` to count the number of unique words for each and call these `nltk_word_counts` and `re_word_counts` 

In [54]:
# YOUR CODE HERE
nltk_word_counts = Counter(x for xs in nltk_tokenized_bios for x in set(xs))
re_word_counts = Counter(x for xs in re_tokenized_bios for x in set(xs))

In [55]:
len(nltk_word_counts), len(re_word_counts)

(294549, 219293)

In [56]:
#hidden tests are within this cell

Let's see what common words we missed through the imprecise regex matching. Create another `Counter` that contains the counts of all words not in `re_word_counts` but `nltk_word_counts` and call this `unique_word_counts`

HINT: familiarize with how to use “Counter” from collections. 
Documentation: https://docs.python.org/3/library/collections.html

In [57]:
# YOUR CODE HERE
unique_word_counts =  list({s for s in nltk_word_counts if s not in re_word_counts})

Print what are the 20 most common NLTK-unique words

In [58]:
# YOUR CODE HERE
unique_word_counts = Counter(unique_word_counts)
unique_word_counts.most_common(20)

[('7:15', 1),
 ('posts.', 1),
 ('166–167', 1),
 ('Wierusz-Kowalski', 1),
 ('====1975====', 1),
 ('Limpsfield-Oxted', 1),
 ('15–20', 1),
 ("L'Action", 1),
 ('KLAC-TV', 1),
 ('=', 1),
 ('four-movie', 1),
 ('non-monogamous', 1),
 ('b.2014', 1),
 ('Anti-Federalist', 1),
 ('34.1', 1),
 ('Forty-seven', 1),
 ('—one', 1),
 ('pre-development', 1),
 ("L'important", 1),
 ('Philemon—', 1)]

In [59]:
len(unique_word_counts)

81881

In [60]:
#hidden tests are within this cell

Interesting! Looks like we're mostly missing punctuation and a few abbreviations. In this case, we can probably get away with tokenizing with just a regex without missing too much. 

For simplicitly, let's treat the regex-based token as our final solution and label this as a list called `all_tokenized_bios` and call the `re_word_counts` as just `word_counts`

In [61]:
all_tokenized_bios = re_tokenized_bios 
word_counts = re_word_counts

In [62]:
len(word_counts)

219293

In [63]:
#hidden tests are within this cell

Since we're dealing with lots of named entities, we didn't lower-case anything when tokenizing, which lets us potentially separate out "Apple" (the company) from "apple" (the fruit). Did we add many tokens from doing this? Create another `Counter` called `lowercase_word_counts` that records all the lower-case word counts and print the number of unique words. How many new upper-case words did we learn?

In [64]:
# YOUR CODE HERE
lowercase_word_counts = [x.lower() for x in word_counts]
lowercase_word_counts = Counter(lowercase_word_counts)

In [65]:
len(word_counts), len(lowercase_word_counts)

(219293, 191199)

In [66]:
#hidden tests are within this cell

In [67]:
%%time
len([w for w,c in word_counts.items() if c >= 100])

CPU times: user 39.9 ms, sys: 130 µs, total: 40.1 ms
Wall time: 39.4 ms


7791

Let's quickly train a word2vec model on the 10,000 tokenized biographies in `all_tokenized_bios`. This should train relatively quickly (~1 minute) and will let us do a few quick sanity tests. Call this model `quick_model` since it should be relative quick to train compared with training on the full set of biographies you have been using.

Here, we'll use arguments to specify:
* 50-dimensional vectors
* a window size of +/-2
* a minimum word frequency of 100
* 4 threads to process in parallel
* The class's `RANDOM_SEED`

*Note:* in the gensim that is installed for this homework, the vector size is specified as `size` in the argument. (Some other versions of the library have different names for the argument!)

In [68]:
%%time
# YOUR CODE HERE
quick_model = Word2Vec(sentences=all_tokenized_bios, min_count=100, size=50, window=2,  workers=4, seed=RANDOM_SEED)

CPU times: user 1min 55s, sys: 2.49 s, total: 1min 58s
Wall time: 35.2 s


The word vectors are easily accessed through the `.wv` field of our model.

In [69]:
quick_word_vectors = quick_model.wv

In [70]:
quick_word_vectors.similar_by_word("chemistry")[:10]

[('physics', 0.8806302547454834),
 ('biology', 0.8794860243797302),
 ('psychology', 0.8791428208351135),
 ('mathematics', 0.8578429818153381),
 ('economics', 0.8430607914924622),
 ('medicine', 0.8429045081138611),
 ('astronomy', 0.8322444558143616),
 ('anthropology', 0.8304941058158875),
 ('sociology', 0.8193984627723694),
 ('engineering', 0.7988761067390442)]

In [71]:
quick_word_vectors.similar_by_word("football")[:10]

[('baseball', 0.9110628962516785),
 ('basketball', 0.9079024195671082),
 ('hockey', 0.8721810579299927),
 ('soccer', 0.8670200109481812),
 ('tennis', 0.8545448780059814),
 ('golf', 0.8269231915473938),
 ('chess', 0.8118621110916138),
 ('cricket', 0.8074628114700317),
 ('wrestling', 0.7804355621337891),
 ('team', 0.7640689015388489)]

### Load the Word2Vec model trained on our Wikipedia corpus

Training a model on the full data can take 20-60 minutes so we've precomputed a model for you at `assets/wikipedia.100.word-vecs.kv`. Let's load these vectors using `gensim.models.KeyedVectors.load` and call this `full_word_vectors`. 

In [72]:
# YOUR CODE HERE
full_word_vectors = gensim.models.KeyedVectors.load('assets/wikipedia.100.word-vecs.kv')

As a quick sanity check, print out the 10 most similar words for "chemistry" and "football" again using these new word vectors

In [73]:
full_word_vectors.similar_by_word("the")[:10]

[('its', 0.8000362515449524),
 ('their', 0.6638365983963013),
 ('this', 0.6563833951950073),
 ('our', 0.6386459469795227),
 ('his', 0.62073814868927),
 ('The', 0.6133779883384705),
 ('a', 0.6116929054260254),
 ('another', 0.5941141247749329),
 ('her', 0.5800747275352478),
 ('every', 0.5692095160484314)]

In [74]:
full_word_vectors.similar_by_word("football")[:10]

[('soccer', 0.8991492390632629),
 ('basketball', 0.8783006072044373),
 ('baseball', 0.8465266227722168),
 ('hockey', 0.8135026097297668),
 ('lacrosse', 0.7592754364013672),
 ('volleyball', 0.7534387111663818),
 ('rugby', 0.7447830438613892),
 ('softball', 0.7415613532066345),
 ('tennis', 0.7191575169563293),
 ('futsal', 0.7188178300857544)]

## Similarity and Relatedness

Looks like both models are capturing a surprising amount of information. Let's test how well both our quickly-learned and fully-trained vectors reflect human judgments of similarity and relatedness. Here, we'll use the [SimLex-999](https://fh295.github.io/simlex.html) and [WordSim-353](http://alfonseca.org/eng/research/wordsim353.html) benchmarks. The WordSim-353 dataset is notable for including judgments of both similarity and relatedness. 

To test on these datasets, we'll use the `evaluate_word_pairs` function of the `KeyedVectors` class which knows how to read these files and score them using both Pearson and Spearman's correlations. The datasets are stored at:
* `assets/wordsim_similarity_goldstandard.txt`
* `assets/wordsim_relatedness_goldstandard.txt`
* `assets/SimLex-999.tsv`

To start, let's compute similarities for `full_word_vectors` and `quick_word_vectors` on both SimLex999 and WordSim353. Save the results of each as variables named with the prefix and dataset, e.g., `full_simlex999` or `quick_wordsim353`, and print the results.

In [78]:
# YOUR CODE HERE
full_simlex999 = full_word_vectors.evaluate_word_pairs(pairs='assets/SimLex-999.tsv')
full_wordsim353 = full_word_vectors.evaluate_word_pairs(pairs='assets/wordsim_similarity_goldstandard.txt')

quick_simlex999 =  quick_word_vectors.evaluate_word_pairs(pairs='assets/SimLex-999.tsv')
quick_wordsim353 = quick_word_vectors.evaluate_word_pairs(pairs='assets/wordsim_similarity_goldstandard.txt')

In [None]:
#hidden tests are within this cell

Let's do the same for relatedness! Here, call your variables `full_relatedness353` and `quick_relatedness353`

In [79]:

# YOUR CODE HERE
full_relatedness353 = full_word_vectors.evaluate_word_pairs(pairs='assets/wordsim_relatedness_goldstandard.txt')
quick_relatedness353 = quick_word_vectors.evaluate_word_pairs(pairs='assets/wordsim_relatedness_goldstandard.txt')

In [80]:
#hidden tests are within this cell

## Word Analogies
Word vectors can not only capture similarity in meaning but relational structure as well. Here, we'll look for analogies of the for `a:b::c:d` or "a is to b as c is to d". In solving these types of analogies, we'll use the `most_similar` function of `KeyedVectors` that will do the vector arithmetic, as described in the original word2vec  paper.

In your first task, write a function `get_analogy` that takes arguments `a`, `b`, `c`, and uses the `most_similar` function to find and return the `d` item that is analogous to `b` in the `full_word_vectors` space.

*NOTE:* you should make sure to pass lists to `most_similar`, not tuples (the latter will work but give incorrect results)

In [116]:
# YOUR CODE HERE
def get_analogy(a,b,c):
    d = full_word_vectors.most_similar(positive=[b, c], negative=[a])
    return d[0][0]

### Task: Find the appropriate analogy entities/concepts for the following relations

For all questions below, please return your answer as a single `str`. 

### Relation 1: Country::Capital

Find the capital of each following countries, according to the fact that London is the capital of UK:
1. France
2. Germany
3. Italy
4. Austria
5. Denmark

In [117]:
answer = get_analogy('UK', 'London', 'France')
print(answer)
#hidden tests are within this cell

Paris


In [118]:
answer = get_analogy('UK', 'London', 'Italy')
print(answer)
#hidden tests are within this cell

Rome


In [119]:
answer = get_analogy('UK', 'London', 'Germany')
print(answer)
#hidden tests are within this cell

Berlin


In [120]:
answer = get_analogy('UK', 'London', 'Austria')
print(answer)
#hidden tests are within this cell

Vienna


In [121]:
answer = get_analogy('UK', 'London', 'Denmark')
print(answer)
#hidden tests are within this cell

Copenhagen


Well, pretty amazing! Considering that these are only trained on Wikipedia biographies, the model has captured quite a bit of information about places.

### Relation 2: Association::Sports

There are a lot of athletes in our dataset. Let's see if we can recover the relationship between sports associations and the sports played in them, NBA is to basketball as NHL is to ???
1. NFL - National ????? League
2. NHL - National ????? League
3. MLB - Major League ?????
4. MLS - Major League ?????
5. NLL - National ????? League

In [122]:
answer = get_analogy('NBA', 'basketball', 'NHL')
print(answer)
#hidden tests are within this cell

soccer


In [123]:
answer = get_analogy('NBA', 'basketball', 'MLB')
print(answer)
#hidden tests are within this cell

baseball


In [124]:
answer = get_analogy('NBA', 'basketball', 'MLS')
print(answer)
#hidden tests are within this cell

soccer


In [125]:
answer = get_analogy('NBA', 'basketball', 'NFL')
print(answer)
#hidden tests are within this cell

baseball


In [126]:
answer = get_analogy('NBA', 'basketball', 'NLL')
print(answer)
#hidden tests are within this cell

soccer


What if we try a different one to predict for the NFL?

In [127]:
answer = get_analogy('MLB', 'baseball', 'NFL')
print(answer)
#hidden tests are within this cell

basketball


What if we try the reverse?

In [128]:
answer = get_analogy('NFL', 'football', 'NBA')
print(answer)
#hidden tests are within this cell

soccer


Huh! This suggests that NFL in our dataset is occurring in different contexts than other sports leagues. But it can be tough to tell what's fully responsible! Perhaps the use of football as both "soccer" and "gridiron football" causes an issue? The model sure seems to like "soccer". 

In [129]:
for bio in all_tokenized_bios[:1000]:
    for i, w in enumerate(bio):
        if w == 'NFL':
            print(' '.join(bio[i-2:i+3]))

at the NFL Kickoff Live
NBA and NFL teams inspiring
two future NFL Hall of
of the NFL In 1977
in the NFL There were
to an NFL franchise but
let go NFL career Philadelphia
on the NFL had changed
by the NFL that a
of the NFL championship game
lowest in NFL history to
of the NFL rosters had
displace the NFL s sovereignty
its history NFL commissioner 1946
and the NFL schedule 1946
the first NFL commissioner in
among the NFL owners since
the next NFL owners meeting
ban any NFL associated personnel
of the NFL to investigate
scams AAFC NFL merger 1948
1950 The NFL s struggle
prevented the NFL for showing
actualized the NFL s first
an AAFC NFL merger was
of the NFL 1950 1956
into the NFL which forbid
for the NFL He negotiated
of the NFL s rise
declared the NFL was subject
that the NFL should be
the first NFL championship game
named interim NFL commissioner for
of the NFL franchises were
regional CBS NFL and CBS
shows for NFL broadcasts and
returned to NFL studio hosting
last hosted NFL telecas

### Relation 3: Expert::Field

Find what fields do the following experts have expertise in, according to the fact that biologists have expertise in biology:
1. sociologists
2. psychologists
3. neuroscientists
4. zoologists
5. physiologists

In [130]:
answer = get_analogy('biologists', 'biology', 'chemists')
print(answer)
#hidden tests are within this cell

chemistry


In [131]:
answer = get_analogy('biologists', 'biology', 'oncologist')
print(answer)
#hidden tests are within this cell

oncology


In [132]:
answer = get_analogy('biologist', 'biology', 'artist')
print(answer)
#hidden tests are within this cell

art


In [133]:
answer = get_analogy('biologist', 'biology', 'zoologist')
print(answer)
#hidden tests are within this cell

botany


In [134]:
answer = get_analogy('biologists', 'biology', 'archaeologists')
print(answer)
#hidden tests are within this cell

archaeology


Not too bad! It seems to have learned something about what the fields and people do!

## Reflection

Which tasks surprised you more? These results show that our model knows a surprising amount. What else does it know? Feel free to try out new types of relationships and words. Keep in mind that it only knows about information in relationship to people since we've trained it on biographies. 

One thing we didn't try was training our model with multi-word expressions (MWEs). Gensim actually supports finding these for us using its `Phrases` package when processing the corpus and learning word vectors. This would let us compare things like `United States` as a phrase. Which kinds of MWEs would be fun to test out? Feel free to retrain the model here and test out new analogies on MWEs; if you find interest ones, post on course Slack.