# The purpose

In any language there is structure in a sentence; verbs communicate action, nouns represent entities, and adjectives describe.

For a single word, synonyms represent a word that has the same or nearly the same meaning that can serve as a substitute. These relationships are curated/stored manually in thesauruses.

However, there are relationships between words that are not synonyms - i.e. both a car and a horse can be means of conveyance (although a horse also exists on its own, unlike a car). What we desire is to build a framework that allows us to identify similarities in words and their meanings that extends beyond substitutable words.

This effort/project would be far too large to be undertaken by humans, the question is how we could quantitatively tackle it. 

# Exploiting structure

Fortunately, languages are constructed as such that there is some similarity in sentence structure and word usage when we attempt to express similar concepts.

*I rode in the car into town.* 

*I rode the horse into town.*

Not **exactly** the same, which makes this task difficult, but sufficiently similar to undertake the task.

# Vectors and matrices - verbal similarity to numeric distance

The basic concept - take words and according to how they are used in bodies of text, encode them in a n-dimensional space. Since there is no starting cartesian coordinate for a word, we have to essentially situate all of the words based on their co-occurences in sentences together. Words that are never together shouldn't be right next to each other, but if they are used in sentences similarly then the direction/magnitude of their relationships should be similar (although distant in vector space).

In [None]:
8 + 4

<img src='../../images/word_matrix.jpg' width='400px'></img>

# Converting context into numbers

There multiple approaches for encoding verbal context into numeric distances, we are going to explore *k-skip n-grams*, which is one of the approaches.

# k-skip n-grams

Once we detail it, then name gives it away. A k-skip n-gram specifies that you will create all n-grams in a sentence up to k-skips away. Similar to a rolling average, k is our window length. So if we take the sentence:

*I ran to the store to pick up milk.*

and said that we should do **bi-grams**, then we would have

`[('I', 'ran'), ('ran', 'to'), ('the', 'store'), ('store', 'to'), ('to', 'pick'), ('pick', 'up'), ('up', 'milk')]`

which is 7 bi-grams.

If I then asked for **1-skip bi-grams**, then we would have:

`[(I, ran), (I, to), (ran, to), (ran, the), (to, the), (to, store), (the, store), (the, to), (store, to), (store, pick), (to, pick), (to, up), (pick, up), (pick, milk), (up, milk)]`

which is 15 bi-grams. 

You can use any combination of `n` and `k` that you desire, but it is already to easy to see how the number of word combinations explodes as `n` and `k` increase for even a single sentence. 

In [None]:
from nltk.util import skipgrams

sentence = ['I', 'ran', 'to', 'the', 'store', 'to', 'pick', 'up', 'milk']
list( skipgrams(sentence, 2, 0) )

In [None]:
list( skipgrams(sentence, 2, 1) )

In [None]:
list( skipgrams(sentence, 2, 2) )

In [None]:
list( skipgrams(sentence, 3, 2) )

Relatively simple to calculate quickly with nltk. Writing our own function wouldn't be awful either, we would just want to make use of the `collections` library and use the combination/permutation functions. 

As a quick test, I am want you to load one the WoS data and calculate how many skip-grams there are for one journal (take your pick of `n` and `k`).

In [None]:
#Exercise
import json
wosdata = json.load(open('../data/wos_topic_doc.json'))

wosdata.keys()


In [None]:
for skipn in [0, 1, 2, 3, 4]:
    skip_count = 0
    for entry in wosdata['AMERICAN_ECONOMIC_REVIEW']:
        skip_count += len( list(skipgrams(entry, 4, skipn)) )
    print(skipn, skip_count)

In [None]:
skip_count

In [None]:
len(wosdata['AMERICAN_ECONOMIC_REVIEW'])

# How does this fit in?

The trick is that we turn this into a prediction problem. When we look at the sentence

`I ran to the store to pick up milk.`

and we construct a skip-gram

`['ran', 'to', 'store']`

there is the skipped word

`'the'`

which we could set up as a `(context, target)` pair

`(['ran', 'to', 'store'], 'the')`

We could set this up as a problem when we input `['ran', 'to', 'store']` to predict `the` or use `the` to predict its context words. 

# Setting up the prediction problem

There are two ways that we can go about this - one is to use the skip grams model we already set up

<img src='../../images/skipgram.png'></img>

the embedding then is the learned weights in the projection layer.  This works best when we have a very large dataset.

The alternative is the continuous bag of words

<img src='../../images/cbow.png'></img>

with this approach, we use the context to predict the target. This averages out the noise of individual words contribution to context, which makes it workable with smaller datasets.

# A primer on neural nets

<img src='../../images/l9_neuralnet_-1.png'></img>

<img src='../../images/l9_neuralnet_0.png'></img>

<img src='../../images/l9_neuralnet_1.png' width='600px'></img>

<img src='../../images/l9_neuralnet_2.png' width='600px'></img>

<img src='../../images/l9_neuralnet_3.png' width='600px'></img>

<img src='../../images/l9_neuralnet_4.png' width='600px'></img>

# Calculating error in a neural network

In [None]:

from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

# Word2Vec in Gensim

We can create our own word2vec model in tensorflow and train it, but gensim also has a built-in implementation that requires less boilerplate code and still retains most options.

In [None]:
sentences = [x.lower().split() for x in ["Human machine interface for lab abc computer applications",
                                         "A survey of user opinion of computer system response time",
                                         "The EPS user interface management system",
                                         "System and human system engineering testing of EPS",              
                                         "Relation of user perceived response time to error measurement",
                                         "The generation of random binary unordered trees",
                                         "The intersection graph of paths in trees",
                                         "Graph minors IV Widths of trees and well quasi ordering",
                                         "Graph minors A survey"]]
sentences

In [None]:
import gensim as gs
#and we can create a model in one shot if we wanted to.
model = gs.models.Word2Vec(sentences, min_count=1, sg=0, seed=1)

And that's it to get up and running. `sg=1` sets Word2Vec to use skip-grams, while `sg=0` sets Word2Vec to use continuous bag of words. 

We could double check what the vector size is (although this is set when we initialize the model)

In [None]:
model.vector_size

And check on the vocabulary constructed from the dataset.

In [None]:
model.wv.vocab

And using one of the words, pull its vector out.

In [None]:
model.wv['trees']

And ask it to predict a word given context.

In [None]:
model.predict_output_word(['graph', 'trees'])

We could pull out the vector weigths for all of the vocabulary words and attempt to look at them also.

Pull out all of the words and plot the first two dimensions with the word labels.

In [None]:
model.wv.vocab

# Visualizing the maximum variance

Picking two dimensions at random isn't a good way to visualize the embedding, since the algorithm is relying on all 100 dimensions to describe the words (i.e. all dimensions are working to create the optimal embedding).

Since we lack the ability to plot/view 100 dimensions simultaneously our best bet is to employ **dimensionality reduction** and create optimally weighted axes across dimensions to capture the variance within the data.

# Principal Component Analysis (PCA)

PCA is a commonly used dimensionality reduction algorithm. It works by transforming a set of data observations into a set of linearly uncorrelated variables (the components). The difference between PCA and Factor Analysis is that the components are orthogonal (i.e. components are at 90 degree angles from one another). The calculation of these components is based on the eigenvectors of the data matrix (which is also what guarantees us that the components are orthogonal).

We can use the scikit learn implementation of PCA

In [None]:
from sklearn.decomposition import PCA




# Moving forward

And this is as far as we can possibly take this example text. To **actually** train a word embedding, you need a massive dataset. 

In lieu of that, we can load up trained representations that someone else has already done. We will use the Stanford Global Vectors for Word Representation (GloVe), since it is a smaller file size than Google's trained word2vec.

https://nlp.stanford.edu/projects/glove/

You can move the downloaded folder into our `data/` folder.

In [None]:
ls ../data/glove/

There are four files that GloVe supplies, they were all trained on the same corpus but with different sized vectors (vector size is marked in the filename). 

We will keep working with the 100 dimension vector. Unlike the google datasets (which are in a standard word2vec format), we need to convert the GloVe dataset to a word2vec format.

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = '../data/glove/glove.6B.100d.txt'
word2vec_output_file = '../data/glove/glove.6B.100d.word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

And now we can read this word2vec file the same as we could one that we downloaded directly from Google.

In [None]:
from gensim.models import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file)

With a large, well-trained dataset we can start pulling out the famous examples also.

`king - man + woman = ?`

In [None]:
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])
result

## How does this work?

Everything is based off of vectors. Given two data points, they each have a vector from the origin.

<img src='../../images/l19_vector_origin.jpg' width='300px'></img>

# Vector addition works by stacking the vectors

If we add `King+Man`, then we have a resultant vector that is the length of both.

<img src='../../images/l19_vector_addition.jpg' width='300px'></img>

# Vector subtraction requires 'flipping' the subtracted vector

<img src='../../images/l19_vector_subtraction.jpg' width='300px'></img>

# Similarity between two vectors is based off of the angle between them

<img src='../../images/l19_cosine_1.jpg' width='300px'></img>
<img src='../../images/l19_cosine_2.jpg' width='300px'></img>

In [None]:
glove_model.most_similar(positive=['cat', 'puppy'], negative=['kitten'])

In [None]:
glove_model.most_similar(positive=['dog', 'kitten'], negative=['puppy'])

When testing these relationships, they work by essentially asking for a relationship with the missing word in the 'negative' list and it doesn't work any which way.

In [None]:
glove_model.most_similar(positive=['dog', 'puppy'], negative=['cat'])

We can further extend this 'play' with other examples. If we enter in one word then we are effectively asking for the nearest neighbors.

In [None]:
glove_model.most_similar(positive=['frog'])

In [None]:
glove_model.most_similar(positive=['puppy'])

And there can be interesting matches given a multi-word concept.

In [None]:
glove_model.most_similar(positive = ['slow', 'slower', 'slowest'])

# How do you test the 'goodness'

Despite the fact that we have set this problem up as a *prediction* one to fit the optimal weights in the neural network, this is still an unsupervised problem (i.e. we are choosing that the nearby context words are either the input or the output).

The **best** test that we possibly have is to see if our trained weights can replicate already agreed upon relationships.

As a part of the original word2vec project, they supplied `questions-words.txt` which has a large number of these analogies.

In [None]:
headers = [line.strip() for line in open('../data/questions-words.txt').readlines() if ':' in line]
headers

The type of relationships are encoded in the lines that start with a ':'

In [None]:
phrases = [l.lower().strip().split() for l in open('../data/questions-words.txt').readlines() if ':' not in l]

In [None]:
phrases[:10]

And then we can test to see if the trained embeddings can reproduce the expected relationships.

All of the lines are set up as 

`x2 - x1 + x3 = x4`

In [None]:
for p in phrases[:10]:
    print(glove_model.most_similar(positive=[p[1], p[2]], negative=[p[0]], topn=1), p[3])

We can test this using the `accuracy` function off of a word2vec model to test the entire accuracy for the whole dataset. It will automatically separate the test set by section.

In [None]:
results = glove_model.accuracy('../data/questions-words.txt')

In [None]:
for h, section in zip(headers, results):
    print(len(section['correct'])/(len(section['correct']) + len(section['incorrect'])), h)

To test a non-standard embedding, you would need to establish these analogies yourself. Given the number of basic parameters you can tweak (including the window size and training speed), there is a distinct need to test the goodness of fit.

# Expanding beyond words

Moving beyond a word to sentence is a more complex task than it may appear, since it must also encode the word structure within the sentence. 

Using word2vec, we could create the average vector over all words in a sentence and then search for similar average vectors from other words. However, there is an extension of `word2vec`, `doc2vec`, that is included in gensim that works much better at the phrase, paragraph, and document level.