In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import spacy
nltk.download('gutenberg')
from nltk.corpus import gutenberg, stopwords

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\nagad\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## Intro to word2vec

The most common unsupervised neural network approach for NLP is word2vec, a shallow neural network model for converting words to vectors using distributed representation: Each word is represented by many neurons, and each neuron is involved in representing many words.  At the highest level of abstraction, word2vec assigns a vector of random values to each word.  For a word W, it looks at the words that are near W in the sentence, and shifts the values in the word vectors such that the vectors for words near that W are closer to the W vector, and vectors for words not near W are farther away from the W vector.  With a large enough corpus, this will eventually result in words that often appear together having vectors that are near one another, and words that rarely or never appear together having vectors that are far away from each other.  Then, using the vectors, similarity scores can be computed for each pair of words by taking the cosine of the vectors.  

This may sound quite similar to the Latent Semantic Analysis approach you just learned.  The conceptual difference is that LSA creates vector representations of sentences based on the words in them, while word2vec creates representations of individual words, based on the words around them.

## What is it good for?

Word2vec is useful for any time when computers need to parse requests written by humans. The problem with human communication is that there are so many different ways to communicate the same concept. It's easy for us, as humans, to know that "the silverware" and "the utensils" can refer to the same thing. Computers can't do that unless we teach them, and this can be a real chokepoint for human/computer interactions. If you've ever played a text adventure game (think _Colossal Cave Adventure_ or _Zork_), you may have encountered the following scenario:

And your brain explodes from frustration. A text adventure game that incorporates a properly trained word2vec model would have vectors for "pick up", "lift", and "take" that are close to the vector for "grab" and therefore could accept those other verbs as synonyms so you could move ahead faster. In more practical applications, word2vec and other similar algorithms are what help a search engine return the best results for your query and not just the ones that contain the exact words you used. In fact, search is a better example, because not only does the search engine need to understand your request, it also needs to match it to web pages that were _also written by humans_ and therefore _also use idiosyncratic language_.

Humans, man.  

So how does it work?

## Generating vectors: Multiple algorithms

In considering the relationship between a word and its surrounding words, word2vec has two options that are the inverse of one another:

 * _Continuous Bag of Words_ (CBOW): the identity of a word is predicted using the words near it in a sentence.
 * _Skip-gram_: The identities of words are predicted from the word they surround. Skip-gram seems to work better for larger corpuses.

For the sentence "Terry Gilliam is a better comedian than a director", if we focus on the word "comedian" then CBOW will try to predict "comedian" using "is", "a", "better", "than", "a", and "director".  Skip-gram will try to predict "is", "a", "better", "than", "a", and "director" using the word "comedian". In practice, for CBOW the vector for "comedian" will be pulled closer to the other words, while for skip-gram the vectors for the other words will be pulled closer to "comedian".  

In addition to moving the vectors for nearby words closer together, each time a word is processed some vectors are moved farther away. Word2vec has two approaches to "pushing" vectors apart:
 
 * _Negative sampling_: Like it says on the tin, each time a word is pulled toward some neighbors, the vectors for a randomly chosen small set of other words are pushed away.
 * _Hierarchical softmax_: Every neighboring word is pulled closer or farther from a subset of words chosen based on a tree of probabilities.

## What is similarity? Word2vec strengths and weaknesses

Keep in mind that word2vec operates on the assumption that frequent proximity indicates similarity, but words can be "similar" in various ways. They may be conceptually similar ("royal", "king", and "throne"), but they may also be functionally similar ("tremendous" and "negligible" are both common modifiers of "size"). Here is a more detailed exploration, [with examples](https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/), of what "similarity" means in word2vec.

One cool thing about word2vec is that it can identify similarities between words _that never occur near one another in the corpus_. For example, consider these sentences:

"The dog played with an elastic ball."
"Babies prefer the ball that is bouncy."
"I wanted to find a ball that's elastic."
"Tracy threw a bouncy ball."

"Elastic" and "bouncy" are similar in meaning in the text but don't appear in the same sentence. However, both appear near "ball". In the process of nudging the vectors around so that "elastic" and "bouncy" are both near the vector for "ball", the words also become nearer to one another and their similarity can be detected.

For a while after it was introduced, [no one was really sure why word2vec worked as well as it did](https://arxiv.org/pdf/1402.3722v1.pdf) (see last paragraph of the linked paper). A few years later, some additional math was developed to explain word2vec and similar models. If you are comfortable with both math and "academese", have a lot of time on your hands, and want to take a deep dive into the inner workings of word2vec, [check out this paper](https://arxiv.org/pdf/1502.03520v7.pdf) from 2016.  

One of the draws of word2vec when it first came out was that the vectors could be used to convert analogies ("king" is to "queen" as "man" is to "woman", for example) into mathematical expressions ("king" + "woman" - "man" = ?) and solve for the missing element ("queen"). This is kinda nifty.

A drawback of word2vec is that it works best with a corpus that is at least several billion words long. Even though the word2vec algorithm is speedy, this is a a lot of data and takes a long time! Our example dataset is only two million words long, which allows us to run it in the notebook without overwhelming the kernel, but probably won't give great results.  Still, let's try it!

There are a few word2vec implementations in Python, but the general consensus is the easiest one to us is in [gensim](https://radimrehurek.com/gensim/models/word2vec.html). Now is a good time to `pip install gensim` if you don't have it yet.

In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [3]:
# Parse the data. This can take some time.
#!python -m spacy download en
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
symbolic link created for C:\Users\nagad\Anaconda3\lib\site-packages\spacy\data\en <<===>> C:\Users\nagad\Anaconda3\lib\site-packages\en_core_web_sm
[+] Linking successful
C:\Users\nagad\Anaconda3\lib\site-packages\en_core_web_sm -->
C:\Users\nagad\Anaconda3\lib\site-packages\spacy\data\en
You can now load the model via spacy.load('en')


In [4]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['lady', 'russell', 'steady', 'age', 'character', 'extremely', 'provide', 'thought', 'second', 'marriage', 'need', 'apology', 'public', 'apt', 'unreasonably', 'discontent', 'woman', 'marry', 'sir', 'walter', 'continue', 'singleness', 'require', 'explanation']
We have 9298 sentences and 900000 tokens.


In [5]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

  "C extension not loaded, training will be slow. "


done!


In [7]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))
print ('+++')
# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("car train bus sister".split()))

[('benwick', 0.9454872012138367), ('goddard', 0.9453417062759399), ('musgrove', 0.9420642852783203), ('harville', 0.9413917660713196), ('charles', 0.9165124297142029), ('clay', 0.9125531315803528), ('croft', 0.9056294560432434), ('wentworth', 0.9022759199142456), ('colonel', 0.854128360748291), ('room', 0.8507972955703735)]
+++
0.92158103


  # This is added back by InteractiveShellApp.init_path()


sister


Clearly this model is not great – while some words given above might possibly fill in the analogy woman:lady::man:?, most answers likely make little sense. You'll notice as well that re-running the model likely gives you different results, indicating random chance plays a large role here.

We do, however, get a nice result on "marriage" being dissimilar to "breakfast", "lunch", and "dinner". 

## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

In [19]:
def model_fn(model):
    
    model=model
    print('done!')
    # List of words in model.
    vocab = model.wv.vocab.keys()

    print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))
    
    # Similarity is calculated using the cosine, so again 1 is total
    # similarity and 0 is no similarity.
    print('\nThe similarity score of mr. and mrs. ', model.wv.similarity('mr', 'mrs'))
    print('The similarity score of sister and brother ', model.wv.similarity('sister', 'brother'))
    print('The similarity score of sofa and forget ', model.wv.similarity('sofa', 'forget'))
    print('The similarity score of breakfast and marriage ', model.wv.similarity('breakfast', 'marriage'))
    print('The similarity score of preserve and hall ', model.wv.similarity('preserve', 'hall'))
    # One of these things is not like the other...
       
    print ('\nBelow are the words similar to "husband"')
    [ print(item) for item in model.wv.most_similar(['husband'])]
    print ('\nBelow are the words similar to "Wife"')
    [ print(item) for item in model.wv.most_similar(['wife'])]

In [20]:
# Tinker with hyperparameters here.
model = word2vec.Word2Vec(
        sentences,
        workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
        min_count=10,  # Minimum word count threshold.
        window=20,      # Number of words around target word to consider.
        sg=1,          # Use CBOW because our corpus is small.
        sample=1e-3 ,  # Penalize frequent words.
        size=300,      # Word vector length.
        hs=1,           # Use hierarchical softmax.

)
model_fn(model)


  "C extension not loaded, training will be slow. "


done!
[('solicitude', 0.7335948944091797), ('event', 0.6837687492370605), ('removal', 0.6537907123565674), ('remove', 0.6537649035453796), ('change', 0.6516532897949219), ('endeavour', 0.6466081142425537), ('advantage', 0.6412881016731262), ('plan', 0.6308327913284302), ('compare', 0.6262389421463013), ('independence', 0.6251139640808105)]

The similarity score of mr. and mrs.  0.71513325
The similarity score of sister and brother  0.8267086
The similarity score of sofa and forget  0.6389605
The similarity score of breakfast and marriage  0.10403478
The similarity score of preserve and hall  0.4787202

Below are the words similar to "husband"
('wife', 0.9193407893180847)
('unhappy', 0.8721081018447876)
('especially', 0.8526115417480469)
('luck', 0.8504742383956909)
('want', 0.8488104343414307)
('thousand', 0.8440971374511719)
('oppose', 0.8435943722724915)
('easy', 0.8431147336959839)
('odd', 0.8384135365486145)
('anybody', 0.8360109329223633)

Below are the words similar to "Wife"
('h

In [21]:
# Tinker with hyperparameters here.
model = word2vec.Word2Vec(
        sentences,
        workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
        min_count=10,  # Minimum word count threshold.
        window=20,      # Number of words around target word to consider.
        sg=0,          # Use CBOW because our corpus is small.
        sample=1e-3 ,  # Penalize frequent words.
        size=300,      # Word vector length.
        hs=1,           # Use hierarchical softmax.

)
model_fn(model)


  "C extension not loaded, training will be slow. "


done!
[('benwick', 0.9540441036224365), ('shirley', 0.9408626556396484), ('mary', 0.9350240230560303), ('louisa', 0.9296693205833435), ('henrietta', 0.9270414710044861), ('navy', 0.9266721606254578), ('prefer', 0.9250074625015259), ('harville', 0.916487991809845), ('musgrove', 0.9151268601417542), ('sufficient', 0.9132134914398193)]

The similarity score of mr. and mrs.  0.973672
The similarity score of sister and brother  0.95952964
The similarity score of sofa and forget  0.9532353
The similarity score of breakfast and marriage  0.98132324
The similarity score of preserve and hall  0.9061974

Below are the words similar to "husband"
('chuse', 0.9985018968582153)
('complaint', 0.9980168342590332)
('sort', 0.9979889988899231)
('hear', 0.9979228377342224)
('voice', 0.997841477394104)
('relief', 0.9977630376815796)
('fit', 0.9972444176673889)
('ah', 0.9971776008605957)
('reserve', 0.9971638917922974)
('small', 0.9970738887786865)

Below are the words similar to "Wife"
('danger', 0.998863

In [22]:
# Tinker with hyperparameters here.
model = word2vec.Word2Vec(
        sentences,
        workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
        min_count=10,  # Minimum word count threshold.
        window=25,      # Number of words around target word to consider.
        sg=1,          # Use CBOW because our corpus is small.
        sample=1e-3 ,  # Penalize frequent words.
        size=300,      # Word vector length.
        hs=1,           # Use hierarchical softmax.

)
model_fn(model)


  "C extension not loaded, training will be slow. "


done!
[('question', 0.6951432228088379), ('meet', 0.6486757397651672), ('ask', 0.6413214206695557), ('report', 0.6333310604095459), ('removal', 0.623195469379425), ('christmas', 0.619205117225647), ('set', 0.6134544014930725), ('tired', 0.6099216938018799), ('business', 0.607487142086029), ('crofts', 0.6070114374160767)]

The similarity score of mr. and mrs.  0.72404826
The similarity score of sister and brother  0.8388392
The similarity score of sofa and forget  0.68120646
The similarity score of breakfast and marriage  0.08157115
The similarity score of preserve and hall  0.5351209

Below are the words similar to "husband"
('wife', 0.9165480136871338)
('unhappy', 0.8616642355918884)
('generally', 0.8479317426681519)
('fellow', 0.84563148021698)
('humour', 0.8435673713684082)
('preserve', 0.842993438243866)
('luck', 0.8324763774871826)
('inform', 0.8306639790534973)
('consider', 0.8305859565734863)
('capable', 0.8303912878036499)

Below are the words similar to "Wife"
('husband', 0.91

By chaning window parameter to 20 the similarity score increased to 84 i.e. it gave mr and mrs as  84% similar now compared to 75% similar.

By changing the sample penalty i.e made to 1e-4 the similarity of mr. and mrs. went to 99% similar.

But this needs to be tested with other words since mr and mrs is not very similar too, there is a huge differnce of gender.

as shown above...it is giving the high similarity scores of 89% to 99% for even two unrelated words like sofa and forget and breakfast and marriage.

Tested all other parameters with different values and found that it worsens the similarity scores..ie. it gives sofa and forget as the highly similar and breakfast and marriage and preserve and hall as highly similar..which we not isnt. so it is better off not to change these values.

the other parameter which can be tweaked is sg which says to use CBOW features with value 1. But changing it to value 0 i.e not to use CBOW features made the scores worst. i.e distantly related words showed similarity.

the best parameter settings are the described in the last cell above. with window=25. this canbe categorized as best here becuase the unrelated words like sofa and forget, breakfast and marriage etc. are showing minimum scores and mr and mrs and sister and brother are showing higher similarity scores.

And even the words related to husband and wife make more sense with the last final hyper parameter change

# Example word2vec applications

You can use the vectors from word2vec as features in other models, or try to gain insight from the vector compositions themselves.

Here are some neat things people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's Pride and Prejudice](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a _truly honest_ account of this data scientist's process.

 * [Tracking changes in Dutch Newspapers' associations with words like 'propaganda' and 'alien' from 1950 to 1990](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces).

 * [Helping customers find clothing items similar to a given item but differing on one or more characteristics](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/).

## Drill 1: Word2Vec on 100B+ words

As we mentioned, word2vec really works best on a big corpus, but it can take half a day to clean such a corpus and run word2vec on it.  Fortunately, there are word2vec models available that have already been trained on _really_ big corpora. They are big files, but you can download a [pretrained model of your choice here](https://github.com/3Top/word2vec-api). At minimum, the ones built with word2vec (check the "Architecture" column) should load smoothly using an appropriately modified version of the code below, and you can play to your heart's content.

Because the models are so large, however, you may run into memory problems or crash the kernel. If you can't get a pretrained model to run locally, check out this [interactive web app of the Google News model](https://rare-technologies.com/word2vec-tutorial/#bonus_app) instead.

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

In [23]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format ('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [30]:
import gensim.downloader as api

word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [32]:
# Play around with your pretrained model here.
word_vectors.most_similar(['husband'])

[('wife', 0.9219692945480347),
 ('mother', 0.8470973968505859),
 ('daughter', 0.8451575636863708),
 ('father', 0.8430609107017517),
 ('friend', 0.8267569541931152),
 ('son', 0.7971027493476868),
 ('brother', 0.7917457222938538),
 ('married', 0.7847107648849487),
 ('girlfriend', 0.7784738540649414),
 ('boyfriend', 0.7674957513809204)]

In [33]:
word_vectors.most_similar(['wife'])

[('daughter', 0.9242234230041504),
 ('husband', 0.9219692945480347),
 ('mother', 0.902587890625),
 ('father', 0.8440753817558289),
 ('sister', 0.8372182846069336),
 ('friend', 0.8299500942230225),
 ('married', 0.8200123310089111),
 ('niece', 0.8161543011665344),
 ('widow', 0.810448169708252),
 ('son', 0.8092067837715149)]

We can observe here that the same words husband and wife give totally different results with different models trained on different datasets albeit they are both word2vec models.

So what we learn from this is that based on what data the word2vec learns the word associations it spits out the similar associations it has seen when asked.
Since the google database it was trained is a much bigger associations corpus it has all the relationships generalized.

In [34]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755736470222473),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534157752991),
 ('prince', 0.6517034769058228),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.631171703338623),
 ('emperor', 0.6106470227241516),
 ('wife', 0.6098655462265015)]

Here it is giving all the words similar to the queen answer...which is ranked the first basedon the vector similarity score measured by the cosine of the angle between the two vectors.

In [42]:
word_vectors.most_similar(positive=['woman', 'prince'], negative=['man'])

[('princess', 0.7514394521713257),
 ('daughter', 0.6816167831420898),
 ('queen', 0.6580643653869629),
 ('duchess', 0.6560284495353699),
 ('niece', 0.6351829767227173),
 ('wife', 0.632085919380188),
 ('eldest', 0.6206661462783813),
 ('married', 0.6112802028656006),
 ('cousin', 0.611089825630188),
 ('throne', 0.6054269075393677)]

I was expecting a princess there it did that.

In [43]:
word_vectors.similarity('woman', 'man')

0.8323494

In [45]:
word_vectors.similarity('forget', 'sofa')

0.19185662

In [46]:
word_vectors.distance("media", "media") ## it shows practically as zero i.e identical.

5.960464477539063e-08

In [47]:
word_vectors.distance("cat", "dog")## they are also close in distance

0.1201925277709961

In [67]:
word_vectors.distance("king", "stool")## they are very distant

1.0028903558850288

In [53]:
word_vectors.distance("king", "rich")## A king is more closer to the rich

0.6529471278190613

In [54]:
word_vectors.distance("king", "poor")## and a king is more distant to the poor

0.7091540098190308

In [80]:
word_vectors['python']  # conversion of word python into a vector what the model sees.

array([ 0.24934  ,  0.68318  , -0.044711 , -1.3842   , -0.0073079,
        0.651    , -0.33958  , -0.19785  , -0.33925  ,  0.26691  ,
       -0.033062 ,  0.15915  ,  0.89547  ,  0.53999  , -0.55817  ,
        0.46245  ,  0.36722  ,  0.1889   ,  0.83189  ,  0.81421  ,
       -0.11835  , -0.53463  ,  0.24158  , -0.038864 ,  1.1907   ,
        0.79353  , -0.12308  ,  0.6642   , -0.77619  , -0.45713  ,
       -1.054    , -0.20557  , -0.13296  ,  0.12239  ,  0.88458  ,
        1.024    ,  0.32288  ,  0.82105  , -0.069367 ,  0.024211 ,
       -0.51418  ,  0.8727   ,  0.25759  ,  0.91526  , -0.64221  ,
        0.041159 , -0.60208  ,  0.54631  ,  0.66076  ,  0.19796  ,
       -1.1393   ,  0.79514  ,  0.45966  , -0.18463  , -0.64131  ,
       -0.24929  , -0.40194  , -0.50786  ,  0.80579  ,  0.53365  ,
        0.52732  ,  0.39247  , -0.29884  ,  0.009585 ,  0.99953  ,
       -0.061279 ,  0.71936  ,  0.32901  , -0.052772 ,  0.67135  ,
       -0.80251  , -0.25789  ,  0.49615  ,  0.48081  , -0.6840

These vectors for words can be incroporated into a dataset as a features and the model can then use these vectors to classify the sentences as similar or dissimialr or positive negative etc. based on the model question. 

For Example, One can use the distances found above to find between positive key word and tell which words are closer to positive and which are closer to negative and based on that using a threshold classify words or sentences or texts as related to particular word or vector.