# Lab 08: Word embeddings (1/2)

<br>

[<img width=900 src="ginsparg.png">](http://www.cs.cornell.edu/~ginsparg/arxiv/gmaps2.html)

Word embeddings are algorithms that represent categorical data like words as vectors in a high dimensional space. 
These are machine learning methods that construct the embedding vectors using cooccurrence statistics, expressed in terms of simple language models. Embeddings reveal surprising semantic relations encoded in linear relationships. But they are "data hungry" and require large corpora of text or other coocurrence data to construct good embeddings. 

In this lab we will explore some of the basics of word embeddings, playing around with two types of embeddings constructed on large amounts of text extracted from [Wikipedia](https://en.wikipedia.org/wiki/Main_Page). There are several tutorials on the web for this material; one is [here](https://medium.com/swlh/playing-with-word-vectors-308ab2faa519).


To begin, load in the usual modules.

In [None]:
from datascience import *
import numpy as np
import re
import gensim

import os
# this turns off some pesky warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"]="3"

import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore",category=Warning)

# direct plots to appear within the cell, and set their style
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

We will use the [gensim package](https://radimrehurek.com/gensim/index.html), already familiar to us from our foray into topic models. The following bit of code reads in 100-dimensional embedding vectors, trained using the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm on a collection of Wikipedia data. Specifically, it uses 6 billion tokens of Wikipedia, with a 400,000 word vocabulary. You can find other precompiled embeddings [here](https://www.diycode.cc/projects/RaRe-Technologies/gensim-data).


In [None]:
import gensim
import gensim.downloader as gdl
from gensim.models import KeyedVectors
glove = gdl.load("glove-wiki-gigaword-100")

Let's explore these embeddings a bit. Here is the vector for 'yale'. Pretty interesting, huh?


In [None]:
glove['yale']

But now let's see which vectors are closest to the 'yale' vector. This is a little more interesting!

In [None]:
glove.most_similar('yale')

### 1. Explore word similarity using embeddings.  

Now, create a few cells where you use the `most_similar` function to find similar words to a few words that you select. Add some markdown to describe your findings, and why they do and do not seem to make sense.

In [None]:
# your code and markdown go here ...

Now, let's look at some of the components of the embedding vectors. What do the distributions of values look like?
We'll first pull out the vocabulary.

In [None]:
vocab = set([w for w in glove.vocab])
len(vocab)

In [None]:
i = 50 # we'll look at this component

x = [] # this will be a list of all 400,000 values, one for each word in the vocabulary
for w in vocab:
    x.append(glove[w][i])

ax = sns.distplot(x)


### 2. Generate a scatter plot

Now, generate a scatter plot of a few <i>pairs</i> of components. For example, you could extract the first and second components of all the embedding vectors. What do you see? Describe the distributions of points. Do they look random?


In [None]:
# Your code and markdown go here


Ok, next we are going to construct a new set of embedding vectors, on a smaller subset of Wikipedia. The following code runs the "word2vec" algorithm to make embeddings. It takes a while to run.

In [None]:
import gensim
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)
# !wget http://mattmahoney.net/dc/text8.zip
# !unzip text8.zip
sentences = word2vec.Text8Corpus('text8')
w2v = word2vec.Word2Vec(sentences, size=100, window=10, min_count=10)


In [None]:
w2v.most_similar('yale')

### 3. Compare the embeddings

Choose five words, and for each word look at the most similar words according to the `glove` and the `w2v` embeddings. What do you find? Describe at least two reasons why one set of embeddings might be better than the other.



In [None]:
# your code and markdown go here ...


### 4. Exploring analogies

Now we'll explore how analogies are "solved" using the embeddings. Here is the canonical example


In [None]:
glove.most_similar(positive=['king', 'woman'], negative=['man'])

Note that to get the top choice, you can just do this:

In [None]:
glove.most_similar(positive=['king', 'woman'], negative=['man'])[0]


Or this:

In [None]:
glove.most_similar(positive=['king', 'woman'], negative=['man'])[0][0]


Now choose at least five analogies. For each one, try to complete them with both the `glove` and the `w2v` model. It will be best if you write some code to do this as you loop over a list of triples of words, like ('king', 'man', 'woman').

Which of the analogies do the models "get right"? Which are clearly wrong? Describe your findings and speculate on some reasons that the models might miss some of the analogies.



In [None]:
# your code and markdown go here ...