In [1]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:.4f}'.format
import re
from nltk.corpus import gutenberg, stopwords
import spacy
from spacy.lang.en import English

In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [3]:
# Parse the data. This can take some time.
nlp = spacy.load("en_core_web_sm")
austen_doc = nlp(austen_clean)

In [4]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['lady', 'russell', 'steady', 'age', 'character', 'extremely', 'provide', 'thought', 'second', 'marriage', 'need', 'apology', 'public', 'apt', 'unreasonably', 'discontent', 'woman', 'marry', 'sir', 'walter', 'continue', 'singleness', 'require', 'explanation']
We have 9299 sentences and 900000 tokens.


In [5]:
import gensim
from gensim.models import word2vec
from sklearn.model_selection import ParameterGrid




Clearly this model is not great – while some words given above might possibly fill in the analogy woman:lady::man:?, most answers likely make little sense. You'll notice as well that re-running the model likely gives you different results, indicating random chance plays a large role here.

We do, however, get a nice result on "marriage" being dissimilar to "breakfast", "lunch", and "dinner". 

## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

#### Answer

Started checking per the queries above.  Actually "lunch" is not in the wv.vocab.  Lunch was not a word in common usage  for the era the book was written so I used "dining", but my results were not great.  
 
Checked the words ["husband", "marriage", "wife", "breakfast"] and did get the expeected result of "breakfast" being dissimilar
 
 The most_similar query did not make any sense as a would guess the "man" and "woman" are very similar words.  I changed mine, but my results were not that much better.
 
 The similarity result for "mr" and "mrs" were consistent for various models and significantly better than the original.  

In [24]:
prm =list(ParameterGrid({'window': [3, 6, 9],'size': [100, 200, 300]}))
mdl = [word2vec.Word2Vec(sentences,size=prm[i]['size'], window=prm[i]['window']) for i in range(len(prm))]
vcb =  mdl[0].wv.vocab

In [41]:
check = "breakfast marriage dinner lunch".split()
check1 = ['lady', 'woman'] + ['man']
check2 = ['mr', 'mrs']
print(all([c in vcb for c in check]), all([c in vcb for c in check1]), all([c in vcb for c in check1]))
check = ["husband", "marriage", "wife", "breakfast"]

False True True


In [47]:
# check using 
A = []; B = []; C =[]
for m in mdl:
    sim = [w for w,v in m.wv.most_similar(positive=['lady', 'woman', 'man'])[0:3]]
    A.append(sim[0]); B.append(sim[1]); C.append(sim[2])
dfs = pd.DataFrame([A,B,C], index=[c for c in 'ABC']).T
D = [m.wv.similarity('mr', 'mrs') for m in mdl]
dfs['D'] = D
E = [m.wv.doesnt_match(check) for m in mdl]
dfs['E'] = E

In [48]:
dfs

Unnamed: 0,A,B,C,D,E
0,eye,mind,evening,0.9989,breakfast
1,mind,eye,short,0.9994,breakfast
2,eye,mind,short,0.9996,breakfast
3,wish,mind,sort,0.9995,breakfast
4,short,mind,acquaintance,0.9997,breakfast
5,short,receive,feeling,0.9997,breakfast
6,perfectly,sort,wish,0.9996,breakfast
7,people,appear,wish,0.9998,breakfast
8,feeling,appear,perfectly,0.9998,breakfast


## Drill 1: Word2Vec on 100B+ words

As we mentioned, word2vec really works best on a big corpus, but it can take half a day to clean such a corpus and run word2vec on it.  Fortunately, there are word2vec models available that have already been trained on _really_ big corpora. They are big files, but you can download a [pretrained model of your choice here](https://github.com/3Top/word2vec-api). At minimum, the ones built with word2vec (check the "Architecture" column) should load smoothly using an appropriately modified version of the code below, and you can play to your heart's content.

Because the models are so large, however, you may run into memory problems or crash the kernel. If you can't get a pretrained model to run locally, check out this [interactive web app of the Google News model](https://rare-technologies.com/word2vec-tutorial/#bonus_app) instead.

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

Used the pretrained *interactive web app of the Google News model*  and was disapointed that my name was not in the vocabulary.  Not really, obviously the web app is quite "inteligent".