# Word vectors for reasoning

In this notebook I experiment with word vectors. Two alternative methods are tested:
- Word2vec from Google
- FastText from Facebook

In [1]:
# Load libraries
import numpy as np
import pandas as pd
pd.options.display.width=120
#pd.set_option('display.width',75)
#pd.options.display.max_columns=8
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download('punkt')
#import nlpia
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from collections import OrderedDict
import copy
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## 1A) Loading gensim.word2vec module (from Google)

You can also do it so that find the original model (binary format) with google search: "Word2vec models pretrained on Google News documents"
- put it in the local path, and then you can load with below script

In [0]:
import gensim.downloader as api

In [3]:
wv=api.load('word2vec-google-news-300')



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
import warnings
warnings.filterwarnings("ignore")

### a) Experimenting with word2vec values

Note, you need to insert wv in front of it (the earlier method without it is deprecated).

most_similar() method:
- most_similar() method can be used for finding nearest neighbors for any given word vector.
- argument: positive adds the vectors together
- argument: negative subtracts the vectors (excludes them)
- argument: topn returns top n values 

doesnt_match() method:
- determines the most unrelated term (with highest distance to all other terms in the input list).

similarity() method:
- calculates cosine similarity between two words

get the word vector for a word (300 dimensions of floats)
- word_vectors['word']

In [5]:
wv.most_similar('cooking')

[('cook', 0.7584654092788696),
 ('Cooking', 0.7552591562271118),
 ('baking', 0.6751805543899536),
 ('cookery', 0.6722506284713745),
 ('humongous_belly', 0.6695600748062134),
 ('cooks', 0.6584445834159851),
 ('sauteeing', 0.6277279853820801),
 ('COOKING_DEADLINE_LOGO_Logo', 0.6251790523529053),
 ('About_Dishing', 0.6237301826477051),
 ('caramelizing_onions', 0.6213988065719604)]

In [6]:
wv.most_similar(positive=['cooking','potatoes'],topn=5)

[('cook', 0.6973531246185303),
 ('oven_roasting', 0.6754531860351562),
 ('Slow_cooker', 0.6742031574249268),
 ('sweet_potatoes', 0.6600280404090881),
 ('stir_fry_vegetables', 0.6548759341239929)]

In [7]:
wv.doesnt_match("potatoes milk cake computer".split())

'computer'

In [8]:
# Perform calculations (e.g. king+woman-man=queen)
wv.most_similar(positive=['king','woman'],negative=['man'],topn=2)

[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827)]

In [9]:
wv.similarity('princess','queen')

0.7070532

## 1B) Generating you own word vector representations (with Word2Vec from Google)

This may be needed if you use a specific domain (e.g. technical vocabulary), not present in Google News.

### a) Preprocess your documents

gensimword2vec model expects a list of sentences where each sentence is broken up into tokens.
- Detector Morse is a sentence segmenter that improves upon the accuracy segmenter available in NLTK. (https://github.com/cslu-nlp/DetectorMorse)
- Here we use nltk's sent_tokenize() method

In [0]:
doc="""To provide early intervention/early childhood special education services to eligible chidren and their families. Essential job functions. \
Participate as a transdisciplinary team member to complete educational assessments for.
"""

In [11]:
# note: sent_tokenize requires nltk.download('punkt') in the import section.
sentences = nltk.tokenize.sent_tokenize(doc)
sentences

['To provide early intervention/early childhood special education services to eligible chidren and their families.',
 'Essential job functions.',
 'Participate as a transdisciplinary team member to complete educational assessments for.']

In [12]:
tokenizer=TreebankWordTokenizer()
token_list=[]
#punctuation=['.','!','?']
for sentence in sentences:
    tokens=tokenizer.tokenize(sentence.lower())
    token_list.append(tokens)
token_list

[['to',
  'provide',
  'early',
  'intervention/early',
  'childhood',
  'special',
  'education',
  'services',
  'to',
  'eligible',
  'chidren',
  'and',
  'their',
  'families',
  '.'],
 ['essential', 'job', 'functions', '.'],
 ['participate',
  'as',
  'a',
  'transdisciplinary',
  'team',
  'member',
  'to',
  'complete',
  'educational',
  'assessments',
  'for',
  '.']]

### b) Train your domain-specific Word2Vec model

In [0]:
from gensim.models.word2vec import Word2Vec

In [0]:
num_features=300 # number of vector elements (dimensions) to represent the word vector. 300 is used in Google News model.
min_word_count=1 # min number of word count to be considered. If the corpus is small, decrease this value (could be e.g. 3 for larger corpus)
num_workers=2 # number of CPU cores
window_size=6 # window for skipgram or continuous bag-of-words approach
subsampling=1e-3 # supsampling rate for frequent words

In [0]:
# Perform training for your domain-specific Word2Vec model
model=Word2Vec(token_list,workers=num_workers,size=num_features,min_count=min_word_count,window=window_size,sample=subsampling)

Word2Vec models can consumer quite a lot of memory. But it is only the weight matrix that is of interest.
- You can reduce the memory footprint by freezing the model and discarding the unnecessary information
- Following command will discard the unneeded output weights of your neural network: 

In [0]:
# Remove unneeded output weights, only hidden weights are needed.
model.init_sims(replace=True)

### c) Save the model - load it back

In [0]:
# Save the trained model for later use.
model_name="my_domain_specific_word2vec_model"
model.save(model_name)

In [0]:
# Load the model
from gensim.models.word2vec import Word2Vec
model_name="my_domain_specific_word2vec_model"
model=Word2Vec.load(model_name)

### d) Test the trained Word2Vec model

In [18]:
# Experiment with your new model
model.most_similar('childhood',topn=3)

[('to', 0.10461775958538055),
 ('education', 0.10317566990852356),
 ('educational', 0.07376974076032639)]

In [19]:
model.wv.most_similar(positive=['childhood','education'],topn=3)

[('participate', 0.10716139525175095),
 ('educational', 0.10102303326129913),
 ('intervention/early', 0.08609788119792938)]

In [20]:
# Which word in the list doesn't belong there
model.wv.doesnt_match("participate childhood education complete".split())

'complete'

In [21]:
model.wv.similarity('participate','participate')

1.0

In [22]:
model.wv.similarity('participate','complete')

0.040114988

In [23]:
#Look at the vector values
model.wv['participate']

array([ 2.45387983e-02, -2.14709640e-02, -5.51058054e-02, -2.36385334e-02,
        8.53215158e-02, -6.39206124e-03, -1.00440048e-01,  5.49790077e-02,
        8.93002301e-02,  8.78949761e-02,  8.77754390e-02, -7.74674416e-02,
        1.98067855e-02, -6.23886893e-03, -2.56746337e-02,  6.62396848e-02,
        8.54819641e-02, -7.59971812e-02, -5.17119132e-02,  6.56401813e-02,
       -3.15501355e-02, -9.94994566e-02,  4.85849828e-02, -5.00031561e-02,
       -9.37737375e-02, -4.30966467e-02, -5.49378991e-02, -9.36845839e-02,
       -8.78868178e-02, -6.88809305e-02, -5.06178774e-02,  7.10132122e-02,
       -6.62240535e-02, -2.52168216e-02, -4.92332457e-03, -6.67347759e-02,
        6.06727302e-02,  9.75582749e-03,  2.43449733e-02,  5.60075603e-02,
       -1.53844031e-02, -1.02046043e-01, -6.71053603e-02, -8.19390938e-02,
        1.07317166e-02, -7.80324489e-02, -3.14557999e-02, -7.44090900e-02,
       -1.39754508e-02, -3.46920307e-04,  3.73141393e-02, -7.92203620e-02,
       -1.80251114e-02, -

### e) Comparison of Word2Vec vs. LSA topic vectors

Topic-document vectors: 
- with LSA : sum of the topic-word vectors for all the words in the documents
- with Word2Vec: sum of all Word2Vec word vectors in each document. Quite close how Doc2vec document vectors work (discussed later)

Word2Vec gets more use out of the same number of words in documents since it uses sliding window
- thus it reuses the same words five times before sliding on.
    
Comparison:
- LSA topic vectors: faster training, better discrimination between longer documents
- Word2Vec and Glove: more efficient use of large corpora, more accurate reasoning with words (analogy questions)

### f) Visualizing word relationships

Note: quick visualization of your word model can be done with TensorBoard's word embedding visualization functionality (see chapter 13)

Let's visualize the word vectors here in 2D to look if we can find some discoveries.
- First load all word vectors from Google Word2Vec model of Google News corpus.

In [24]:
# It was loaded already in this notebook. Let's check its length
len(wv.vocab)

3000000

In [25]:
# See pages 208-213 from the book for 2D visualisation, using PCA.
import pandas as pd
vocab=pd.Series(wv.vocab)
vocab.iloc[100000:100006]

distinctiveness    Vocab(count:2900000, index:100000)
Namco_Bandai       Vocab(count:2899999, index:100001)
ramparts           Vocab(count:2899998, index:100002)
Linden_Lab         Vocab(count:2899997, index:100003)
Revolutions        Vocab(count:2899996, index:100004)
Henderson_Nev.     Vocab(count:2899995, index:100005)
dtype: object

In [26]:
wv['distinctiveness'][0:100]

array([ 0.14453125,  0.04223633,  0.12304688,  0.07421875, -0.13671875,
        0.17871094,  0.06933594, -0.14257812,  0.2265625 ,  0.1640625 ,
       -0.3125    ,  0.10986328, -0.09960938,  0.38671875, -0.30078125,
       -0.13183594, -0.24316406,  0.30859375, -0.12792969, -0.16015625,
       -0.39453125, -0.1484375 , -0.08691406,  0.26367188, -0.13085938,
       -0.27539062,  0.11328125, -0.15234375,  0.35742188,  0.07617188,
        0.04711914, -0.15332031,  0.07128906,  0.24121094,  0.16601562,
       -0.04858398,  0.17578125, -0.19824219, -0.13574219, -0.16699219,
       -0.17578125, -0.14257812, -0.09033203,  0.33007812, -0.07373047,
       -0.13867188, -0.05541992,  0.37695312,  0.08398438,  0.0859375 ,
       -0.14648438,  0.31445312,  0.02905273, -0.14746094,  0.05493164,
        0.18652344, -0.26367188, -0.19433594,  0.06787109, -0.05932617,
        0.30273438,  0.05737305,  0.14453125, -0.625     , -0.42578125,
        0.01165771, -0.10449219, -0.0625    ,  0.14160156, -0.06

In [27]:
# Distance between distinctiveness and distintive
import numpy as np
np.linalg.norm(wv['distinctiveness']-wv['distinctive']) # Euclidean distance

3.1394858

In [28]:
# Cosine similarity is the normalized dot product
cos_similarity=np.dot(wv['distinctiveness'],wv['distinctive'])/(
    np.linalg.norm(wv['distinctiveness'])*np.linalg.norm(wv['distinctive']))
cos_similarity

0.5303259

In [29]:
# Cosine distance
1 - cos_similarity

0.46967411041259766

### g) Document similarity with Doc2Vec

If you are running on low RAM and you know the number of documents ahead of time you can use preallocated numpy array instead of Python list for training_corpus.
- training_corpus=np.empty(len(corpus),dtype(object))
- ... training_corpus[i]=...

In [0]:
# Use gensim package to train document vector
import multiprocessing
num_cores=multiprocessing.cpu_count()  # calculates how many CPU cores are available
from gensim.models.doc2vec import TaggedDocument, Doc2Vec   # Doc2Vec contains both word embeddings as well as doc vectors for each doc in your corpus.
from gensim.utils import simple_preprocess # crude tokenizer that ignores one-letter words and punctuation. Also other tokenizers work fine.
training_corpus=[]
for i,text in enumerate(sentences):   # here the corpus is doc variable, where each sentence is considered one doc
    tagged_doc=TaggedDocument(simple_preprocess(text),[i])  # documents are annotated with strings or integer tags or whatever info you want to tag your docs
    training_corpus.append(tagged_doc)

In [0]:
model=Doc2Vec(size=100,min_count=1,workers=num_cores,iter=10)  # window size of 10 words, dimensions: 100
model.build_vocab(training_corpus)
model.train(training_corpus,total_examples=model.corpus_count,epochs=model.epochs)

In [32]:
# Then infer documents with the doc2vectors
# steps: update the trained vector through 10 steps (iterations) -> quickly train the entire corpus of docs and find similar docs.
model.infer_vector(simple_preprocess('This is a completely unseen document'),steps=10)  

array([ 5.1482586e-04, -8.7723223e-04, -1.0287223e-03, -1.6943730e-03,
       -1.0099850e-03, -1.8903159e-03,  4.4856564e-04, -4.7034067e-03,
       -1.1170713e-03, -2.5134217e-03,  2.0423906e-03, -2.5497428e-03,
        2.8503904e-04,  4.7316640e-03, -3.8181783e-03, -1.6194887e-03,
       -4.3283659e-03,  2.0670765e-03,  4.0842486e-03,  1.3520226e-03,
       -1.8441958e-03,  3.7693719e-03, -3.8640937e-03,  2.0029738e-03,
        8.4973167e-04,  6.4203358e-04, -6.7998620e-04, -4.0455558e-03,
        2.9018931e-03,  3.3808856e-03,  1.2826681e-03,  3.9480086e-03,
        2.0543612e-03,  1.7407783e-03,  3.5341850e-03,  4.2284536e-03,
       -2.3516305e-03,  4.7414396e-03,  4.2300774e-03,  3.2893044e-03,
        3.9262362e-03, -1.0382166e-03,  3.3581874e-03,  1.9825818e-03,
        3.6963215e-03, -1.8682105e-04, -1.3601539e-03,  2.2856826e-03,
        1.1717859e-03, -4.4503110e-03, -7.2348089e-04, -1.6727307e-03,
       -3.4445480e-03, -1.0842012e-03, -4.5640469e-03, -3.4773993e-04,
      

Common tasks with Doc2Vec
- find similar docs (by calculating cosine distance between each document vector)
- cluster the document vectors with something like k-means to create a document classifier

In [33]:
# Look at document vector
model[1]

array([ 2.99791526e-03, -1.51098924e-04, -3.53953009e-03,  2.31808261e-03,
       -2.16730661e-03, -1.08175015e-03,  8.61137931e-04,  2.15832074e-03,
       -4.17979294e-03,  4.41799127e-03, -4.75926092e-03, -3.18557001e-03,
       -1.61611976e-03, -6.72612689e-04,  2.24564318e-03,  2.33253697e-03,
        1.75195048e-03, -1.52676913e-03, -3.85199487e-03,  8.75639438e-04,
       -3.35985771e-03, -3.02240187e-05, -2.12348555e-03,  2.86724488e-03,
       -2.91198958e-03, -2.91085127e-03, -5.50444587e-04, -1.72330614e-03,
        2.32299106e-04,  5.16094966e-04, -3.83294566e-04,  4.76076640e-03,
       -4.44446784e-03, -1.75583863e-03,  2.84307782e-04,  1.54030378e-04,
       -6.21339539e-04,  8.51287987e-05, -4.86089120e-05,  1.84070843e-03,
        1.72342139e-03,  2.93154945e-03,  1.17985730e-03,  2.01370241e-03,
       -2.21741945e-03,  1.22255494e-03, -3.88861750e-04, -1.35775271e-03,
        2.50758021e-03, -2.39759684e-03, -3.52050731e-04, -4.25555278e-03,
        8.25439056e-04,  

In [34]:
# Look at document vector
model[2]

array([-4.38587973e-03, -3.57072474e-03,  3.88431968e-03, -3.47585743e-03,
        4.75853402e-03,  2.79352395e-03,  1.26587390e-03, -3.00626713e-03,
        3.41877574e-03, -4.91604907e-03, -2.69041071e-03,  3.25385598e-03,
        3.46013950e-03, -5.85328904e-04, -3.93068977e-03, -4.48601600e-03,
        3.22551560e-03, -3.64639168e-03, -2.90193758e-03,  4.85022413e-03,
        3.61483335e-03,  6.08189090e-04, -3.75569012e-04,  1.58393080e-03,
       -2.18574191e-03,  4.77644242e-03, -1.14011962e-03, -3.77471698e-03,
        3.03882430e-03,  3.97146167e-03,  1.23964564e-03,  2.08758726e-03,
        4.11385344e-03, -5.96546393e-04, -8.63870548e-04, -3.46013415e-03,
       -3.88592342e-03,  2.72085005e-03, -3.88643844e-03,  1.16048905e-04,
       -3.56198126e-03, -3.09477118e-03, -4.99721849e-03,  3.15565430e-03,
       -1.67173077e-03, -4.89549898e-03,  4.84334212e-03,  4.26977361e-03,
        4.39984491e-03, -3.46464664e-03, -1.88309746e-03,  9.30987007e-04,
        4.11939668e-03, -

In [35]:
# Calculate the Euclidean distance between document vectors
np.linalg.norm(model[0]-model[1]), np.linalg.norm(model[0]-model[2]), np.linalg.norm(model[1]-model[2])

(0.036447875, 0.036241837, 0.041457236)

In [36]:
# Calculate cosine similarities between document vectors
cos_similarity0=np.dot(model[0],model[1]) / (np.linalg.norm(model[0])*np.linalg.norm(model[1]))
cos_similarity1=np.dot(model[0],model[2]) / (np.linalg.norm(model[0])*np.linalg.norm(model[2]))
cos_similarity2=np.dot(model[1],model[2]) / (np.linalg.norm(model[1])*np.linalg.norm(model[2]))
cos_similarity0,cos_similarity1,cos_similarity2
# They are all rather different

(0.04435811, 0.16789521, -0.07092797)

In [37]:
# Calculate cosine distances between document vectors
1-cos_similarity0,1-cos_similarity1,1-cos_similarity2
# Shortest distance is between doc0 and doc2.

(0.9556418918073177, 0.8321047872304916, 1.07092797011137)

## 2A) Load the FastText model from Facebook

There are different language versions

How it works is very similar to Google's Word2Vec model.
- download the bin+text model for your language of choice to your local MODEL_PATH
- unzip the binary language file (note: en.wiki.zip file is 9.6GB)
- then with following script, load it into gensim

In [0]:
# from gensim.models.fasttext import FastText
# ft_model=FastText.load_fasttext_format(model_file=MODEL_PATH)
# ft_model.most_similar('soccer')

## 3A) GloVe (Global Vectors from Stanford)

Stanford GloVe project: https://nlp.stanford.edu/projects/glove/
- publication: Global Vectors for Word representation: https://nlp.stanford.edu/pubs/glove.pdf
- method : SVD singular value decomposition of word co-occurrence matrix, splitting it into two weight matrices that Word2Vec produces
- the key was to normalize the co-occurrence matrix the same way. 
- sometimes Word2Vec was not able to converge to global optimum, while GloVe did.

Thus the main idea in GloVe is direct optimization of global vectors of word co-occurrences (across the entire corpus), which gives it its name.
- Word2Vectors relies on backpropagation to update the weights, which is less efficient.

Thus it is recommended nowadays to use GloVe to train new word vector representations.
- faster training, better RAM/CPU efficiency
- more efficient use of data (better for smaller corpora)
- more accurate for the same amount of training.