In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords

## Intro to word2vec

The most common unsupervised neural network approach for NLP is word2vec, a shallow neural network model for converting words to vectors using distributed representation: Each word is represented by many neurons, and each neuron is involved in representing many words.  At the highest level of abstraction, word2vec assigns a vector of random values to each word.  For a word W, it looks at the words that are near W in the sentence, and shifts the values in the word vectors such that the vectors for words near that W are closer to the W vector, and vectors for words not near W are farther away from the W vector.  With a large enough corpus, this will eventually result in words that often appear together having vectors that are near one another, and words that rarely or never appear together having vectors that are far away from each other.  Then, using the vectors, similarity scores can be computed for each pair of words by taking the cosine of the vectors.  

This may sound quite similar to the Latent Semantic Analysis approach you just learned.  The conceptual difference is that LSA creates vector representations of sentences based on the words in them, while word2vec creates representations of individual words, based on the words around them.


## What is it good for?

Word2vec is useful for any time when computers need to parse requests written by humans. The problem with human communication is that there are so many different ways to communicate the same concept. It's easy for us, as humans, to know that "the silverware" and "the utensils" can refer to the same thing. Computers can't do that unless we teach them, and this can be a real chokepoint for human/computer interactions. If you've ever played a text adventure game (think _Colossal Cave Adventure_GAME: You grab the stick from the ground and put it in your bag.   or _Zork_), you may have encountered the following scenario:


GAME: You are on a forest path north of the field. A cave leads into a granite butte to the north.
A thick hedge blocks the way to the west.
A hefty stick lies on the ground.

YOU: pick up stick  

GAME: You don't know how to do that.  

YOU: lift stick  

GAME: You don't know how to do that.  

YOU: take stick  

GAME: You don't know how to do that.  

YOU: grab stick  

GAME: You grab the stick from the ground and put it in your bag.  



And your brain explodes from frustration. A text adventure game that incorporates a properly trained word2vec model would have vectors for "pick up", "lift", and "take" that are close to the vector for "grab" and therefore could accept those other verbs as synonyms so you could move ahead faster. In more practical applications, word2vec and other similar algorithms are what help a search engine return the best results for your query and not just the ones that contain the exact words you used. In fact, search is a better example, because not only does the search engine need to understand your request, it also needs to match it to web pages that were _also written by humans_ and therefore _also use idiosyncratic language_.

Humans, man.  

So how does it work?

## Generating vectors: Multiple algorithms

In considering the relationship between a word and its surrounding words, word2vec has two options that are the inverse of one another:

 * _Continuous Bag of Words_ (CBOW): the identity of a word is predicted using the words near it in a sentence.
 * _Skip-gram_: The identities of words are predicted from the word they surround. Skip-gram seems to work better for larger corpuses.

For the sentence "Terry Gilliam is a better comedian than a director", if we focus on the word "comedian" then CBOW will try to predict "comedian" using "is", "a", "better", "than", "a", and "director".  Skip-gram will try to predict "is", "a", "better", "than", "a", and "director" using the word "comedian". In practice, for CBOW the vector for "comedian" will be pulled closer to the other words, while for skip-gram the vectors for the other words will be pulled closer to "comedian".  

In addition to moving the vectors for nearby words closer together, each time a word is processed some vectors are moved farther away. Word2vec has two approaches to "pushing" vectors apart:
 
 * _Negative sampling_: Like it says on the tin, each time a word is pulled toward some neighbors, the vectors for a randomly chosen small set of other words are pushed away.
 * _Hierarchical softmax_: Every neighboring word is pulled closer or farther from a subset of words chosen based on a tree of probabilities.

## What is similarity? Word2vec strengths and weaknesses

Keep in mind that word2vec operates on the assumption that frequent proximity indicates similarity, but words can be "similar" in various ways. They may be conceptually similar ("royal", "king", and "throne"), but they may also be functionally similar ("tremendous" and "negligible" are both common modifiers of "size"). Here is a more detailed exploration, [with examples](https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/), of what "similarity" means in word2vec.

One cool thing about word2vec is that it can identify similarities between words _that never occur near one another in the corpus_. For example, consider these sentences:

"The dog played with an elastic ball."
"Babies prefer the ball that is bouncy."
"I wanted to find a ball that's elastic."
"Tracy threw a bouncy ball."

"Elastic" and "bouncy" are similar in meaning in the text but don't appear in the same sentence. However, both appear near "ball". In the process of nudging the vectors around so that "elastic" and "bouncy" are both near the vector for "ball", the words also become nearer to one another and their similarity can be detected.

For a while after it was introduced, [no one was really sure why word2vec worked as well as it did](https://arxiv.org/pdf/1402.3722v1.pdf) (see last paragraph of the linked paper). A few years later, some additional math was developed to explain word2vec and similar models. If you are comfortable with both math and "academese", have a lot of time on your hands, and want to take a deep dive into the inner workings of word2vec, [check out this paper](https://arxiv.org/pdf/1502.03520v7.pdf) from 2016.  

One of the draws of word2vec when it first came out was that the vectors could be used to convert analogies ("king" is to "queen" as "man" is to "woman", for example) into mathematical expressions ("king" + "woman" - "man" = ?) and solve for the missing element ("queen"). This is kinda nifty.

A drawback of word2vec is that it works best with a corpus that is at least several billion words long. Even though the word2vec algorithm is speedy, this is a a lot of data and takes a long time! Our example dataset is only two million words long, which allows us to run it in the notebook without overwhelming the kernel, but probably won't give great results.  Still, let's try it!

There are a few word2vec implementations in Python, but the general consensus is the easiest one to us is in [gensim](https://radimrehurek.com/gensim/models/word2vec.html). Now is a good time to `pip install gensim` if you don't have it yet.




In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [4]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

In [5]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['lady', 'russell', 'steady', 'age', 'character', 'extremely', 'provide', 'thought', 'second', 'marriage', 'need', 'apology', 'public', 'apt', 'unreasonably', 'discontent', 'woman', 'marry', 'sir', 'walter', 'continue', 'singleness', 'require', 'explanation']
We have 9298 sentences and 900000 tokens.


In [6]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

  "C extension not loaded, training will be slow. "


done!


In [7]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

[('benwick', 0.9529677629470825), ('musgrove', 0.9454778432846069), ('goddard', 0.941842257976532), ('harville', 0.9345360994338989), ('clay', 0.9270812273025513), ('wentworth', 0.9134777784347534), ('charles', 0.8856989145278931), ('weston', 0.8675721883773804), ('colonel', 0.863426685333252), ('croft', 0.8356888294219971)]
0.9353262


  # This is added back by InteractiveShellApp.init_path()
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


marriage


Clearly this model is not great – while some words given above might possibly fill in the analogy woman:lady::man:?, most answers likely make little sense. You'll notice as well that re-running the model likely gives you different results, indicating random chance plays a large role here.

We do, however, get a nice result on "marriage" being dissimilar to "breakfast", "lunch", and "dinner". 

## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

In [8]:
# Let's reduce a few things
model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=8,  # Minimum word count threshold.
    window=5,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=200,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

  "C extension not loaded, training will be slow. "


done!


In [9]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))
print(model.wv.most_similar(positive=['lady', 'woman'], negative=['man']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))


[('goddard', 0.890511691570282), ('hall', 0.852965235710144), ('croft', 0.8232969045639038), ('weston', 0.8204270005226135), ('smith', 0.8192675113677979), ('clay', 0.8150319457054138), ('musgrove', 0.8028014898300171), ('cole', 0.7975066304206848), ('capable', 0.7952784299850464), ('bates', 0.7942761182785034)]
[('goddard', 0.8989530801773071), ('dalrymple', 0.892135739326477), ('clay', 0.8681119680404663), ('removal', 0.859772801399231), ('wallis', 0.852104902267456), ('hall', 0.8512575626373291), ('elizabeth', 0.847594141960144), ('stir', 0.8447807431221008), ('self', 0.8437947630882263), ('remove', 0.8425743579864502)]
0.6601817


In [10]:
# That was a poor result
# Let's increase a few things this time
model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=20,  # Minimum word count threshold.
    window=7,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=500,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

  "C extension not loaded, training will be slow. "


done!


In [11]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))
print(model.wv.most_similar(positive=['lady', 'woman'], negative=['man']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))


[('benwick', 0.9614158272743225), ('clay', 0.9547545909881592), ('musgrove', 0.9520049095153809), ('charles', 0.9471226930618286), ('colonel', 0.946020245552063), ('goddard', 0.9451919794082642), ('smith', 0.944114089012146), ('harville', 0.9383304119110107), ('croft', 0.9375532865524292), ('room', 0.937420129776001)]
[('highly', 0.9715672731399536), ('door', 0.9699399471282959), ('bates', 0.9697773456573486), ('tell', 0.9693322777748108), ('nurse', 0.9682864546775818), ('quit', 0.9680571556091309), ('yes', 0.9680072069168091), ('forward', 0.9677501320838928), ('married', 0.9674652814865112), ('manner', 0.9672720432281494)]
0.9368215


So much better!!  Bigger vector length and word count threshold greatly improve the result


# Example word2vec applications

You can use the vectors from word2vec as features in other models, or try to gain insight from the vector compositions themselves.

Here are some neat things people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's Pride and Prejudice](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a _truly honest_ account of this data scientist's process.

 * [Tracking changes in Dutch Newspapers' associations with words like 'propaganda' and 'alien' from 1950 to 1990](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces).

 * [Helping customers find clothing items similar to a given item but differing on one or more characteristics](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/).

## Drill 1: Word2Vec on 100B+ words

As we mentioned, word2vec really works best on a big corpus, but it can take half a day to clean such a corpus and run word2vec on it.  Fortunately, there are word2vec models available that have already been trained on _really_ big corpora. They are big files, but you can download a [pretrained model of your choice here](https://github.com/3Top/word2vec-api). At minimum, the ones built with word2vec (check the "Architecture" column) should load smoothly using an appropriately modified version of the code below, and you can play to your heart's content.

Because the models are so large, however, you may run into memory problems or crash the kernel. If you can't get a pretrained model to run locally, check out this [interactive web app of the Google News model](https://rare-technologies.com/word2vec-tutorial/#bonus_app) instead.

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

In [12]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format ('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [18]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.most_similar(positive=['lady', 'man'], negative=['woman']))
print(model.most_similar(positive=['lady', 'woman'], negative=['man']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.similarity('mr', 'mrs'))


  


[('fella', 0.6031545400619507), ('gentleman', 0.5849649906158447), ('chap', 0.5543248653411865), ('gent', 0.543907880783081), ('guy', 0.5265033841133118), ('lad', 0.5139425992965698), ('feller', 0.5072450041770935), ('bloke', 0.49030160903930664), ('rascal', 0.4873698949813843), ('ladies', 0.47617611289024353)]
[('she', 0.6053774952888489), ('her', 0.561516284942627), ('ladies', 0.5504916906356812), ('actress', 0.5367344617843628), ('Bernadette_Chirac', 0.527031660079956), ('lady_Laura', 0.5055365562438965), ('businesswoman', 0.5041493773460388), ('herself', 0.5009844899177551), ('housewife', 0.49794477224349976), ('beauty_queen', 0.4962754249572754)]
0.66098833


In [21]:
print(model.most_similar(positive=['good', 'bad']))

[('terrible', 0.6704098582267761), ('lousy', 0.6693953275680542), ('horrible', 0.6417661309242249), ('great', 0.6051317453384399), ('decent', 0.5867295265197754), ('nice', 0.5843954086303711), ('Bad', 0.5827116370201111), ('terrific', 0.5770761966705322), ('crummy', 0.5693639516830444), ('tough', 0.567923903465271)]


In [25]:
print(model.most_similar(positive=['good', 'helpful', 'outstanding']))
print(model.most_similar(positive=['good', 'helpful', 'outstanding'], negative=['evil']))
print(model.most_similar(positive=['good', 'helpful', 'outstanding'], negative=['bad']))

[('excellent', 0.6911307573318481), ('terrific', 0.6883692741394043), ('great', 0.6443691253662109), ('fantastic', 0.6145073175430298), ('nice', 0.6141899824142456), ('wonderful', 0.6014581918716431), ('useful', 0.5964241027832031), ('beneficial', 0.5943193435668945), ('oustanding', 0.5792140960693359), ('important', 0.5531494617462158)]
[('excellent', 0.6342078447341919), ('terrific', 0.6201268434524536), ('nice', 0.5538926124572754), ('oustanding', 0.5491364002227783), ('great', 0.5491347908973694), ('useful', 0.5406490564346313), ('beneficial', 0.5310587882995605), ('fantastic', 0.5294796228408813), ('invaluable', 0.5175262689590454), ('valuable', 0.49921607971191406)]
[('excellent', 0.6606455445289612), ('terrific', 0.6122527122497559), ('invaluable', 0.6005567312240601), ('oustanding', 0.5828166007995605), ('useful', 0.5809886455535889), ('beneficial', 0.5546332597732544), ('great', 0.5533972382545471), ('wonderful', 0.5523094534873962), ('fantastic', 0.5364223718643188), ('except

In [24]:
print(model.similarity('good', 'bad'))
print(model.similarity('good', 'poor'))
print(model.similarity('pass', 'fail'))

0.7190051
0.4598992
0.20800292
