<a href="https://colab.research.google.com/github/aarondelgiudice/thinkful_data_bootcamp/blob/master/unit_4/lesson_4/Neural_Networks_and_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords

In [0]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [0]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

In [0]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['lady', 'russell', 'steady', 'age', 'character', 'extremely', 'provide', 'thought', 'second', 'marriage', 'need', 'apology', 'public', 'apt', 'unreasonably', 'discontent', 'woman', 'marry', 'sir', 'walter', 'continue', 'singleness', 'require', 'explanation']
We have 9298 sentences and 900000 tokens.


In [0]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

  "C extension not loaded, training will be slow. "


done!


In [0]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.wv.doesnt_match("breakfast marriage dinner lunch".split()))

[('goddard', 0.9636046886444092), ('musgrove', 0.9479912519454956), ('harville', 0.926499605178833), ('clay', 0.9178221225738525), ('benwick', 0.9170636534690857), ('weston', 0.9140954613685608), ('smith', 0.9035476446151733), ('croft', 0.8997906446456909), ('charles', 0.8898350596427917), ('room', 0.8839731216430664)]
0.8924593


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


marriage


Clearly this model is not great – while some words given above might possibly fill in the analogy woman:lady::man:?, most answers likely make little sense. You'll notice as well that re-running the model likely gives you different results, indicating random chance plays a large role here.

We do, however, get a nice result on "marriage" being dissimilar to "breakfast", "lunch", and "dinner". 

## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

### Default Parameters

In [0]:
# run with default parameters
model = word2vec.Word2Vec(sentences)

  "C extension not loaded, training will be slow. "


In [0]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.wv.doesnt_match("breakfast marriage dinner lunch".split()))

[('see', 0.9985136985778809), ('highly', 0.9984995126724243), ('character', 0.9984800219535828), ('wish', 0.9984713792800903), ('view', 0.9984613656997681), ('feel', 0.9984591007232666), ('fancy', 0.9984588027000427), ('ill', 0.9984580278396606), ('have', 0.9984573125839233), ('new', 0.9984554052352905)]
0.9991336
breakfast


### Parameter Optimization

In [0]:
# set variable ranges
min_counts = [1,5,10]
# iterate over variable ranges
for min_count in min_counts:
    print('min_count:', min_count)
    # run with default parameters
    model = word2vec.Word2Vec(sentences, min_count=min_count)
    # List of words in model.
    vocab = model.wv.vocab.keys()
    print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))
    # Similarity is calculated using the cosine, so again 1 is total
    # similarity and 0 is no similarity.
    print(model.wv.similarity('mr', 'mrs'))
    # One of these things is not like the other...
    print(model.wv.doesnt_match("breakfast marriage dinner lunch".split()), '\n')

min_count: 1


  "C extension not loaded, training will be slow. "


[('see', 0.9993274807929993), ('feel', 0.9992974400520325), ('wish', 0.9992825388908386), ('friend', 0.9992712140083313), ('acquaintance', 0.9992703199386597), ('ill', 0.999269962310791), ('live', 0.9992690682411194), ('like', 0.9992680549621582), ('doubt', 0.9992673993110657), ('have', 0.9992637038230896)]
0.9994407
breakfast 

min_count: 5


  "C extension not loaded, training will be slow. "


[('highly', 0.9989069700241089), ('have', 0.9988794922828674), ('feel', 0.9988778233528137), ('wish', 0.9988672733306885), ('live', 0.9988651275634766), ('regular', 0.9988642334938049), ('look', 0.9988552927970886), ('word', 0.9988548159599304), ('see', 0.9988521933555603), ('acquaintance', 0.9988520741462708)]
0.9990156
breakfast 

min_count: 10


  "C extension not loaded, training will be slow. "


[('highly', 0.9980338215827942), ('see', 0.997999906539917), ('acquaintance', 0.9979976415634155), ('wish', 0.9979618787765503), ('character', 0.9979562759399414), ('convince', 0.99793940782547), ('fancy', 0.9979298114776611), ('friend', 0.997927725315094), ('look', 0.9979273676872253), ('have', 0.9979252815246582)]
0.9988124
breakfast 



As mean count increases, our results decrease.

# Example word2vec applications

You can use the vectors from word2vec as features in other models, or try to gain insight from the vector compositions themselves.

Here are some neat things people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's Pride and Prejudice](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a _truly honest_ account of this data scientist's process.

 * [Tracking changes in Dutch Newspapers' associations with words like 'propaganda' and 'alien' from 1950 to 1990](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces).

 * [Helping customers find clothing items similar to a given item but differing on one or more characteristics](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/).

## Drill 1: Word2Vec on 100B+ words

As we mentioned, word2vec really works best on a big corpus, but it can take half a day to clean such a corpus and run word2vec on it.  Fortunately, there are word2vec models available that have already been trained on _really_ big corpora. They are big files, but you can download a [pretrained model of your choice here](https://github.com/3Top/word2vec-api). At minimum, the ones built with word2vec (check the "Architecture" column) should load smoothly using an appropriately modified version of the code below, and you can play to your heart's content.

Because the models are so large, however, you may run into memory problems or crash the kernel. If you can't get a pretrained model to run locally, check out this [interactive web app of the Google News model](https://rare-technologies.com/word2vec-tutorial/#bonus_app) instead.

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

In [0]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format ('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [0]:
# Play around with your pretrained model here.


