# First steps with word embeddings
Big Data and Automated Content Analysis

Damian Trilling


This notebook shows you some first steps of how to work with word embeddings in Gensim. 


**NB  Some of these things may take a lot of memory and/or computing power. **

## Training word embeddings
Training word embedings with gensim is very simple (see example code below - even though you wildo l need to specift some extra options). However, you need a massive dataset for that (think of millions of documents). We're therefore not going to do this in class.

```
model = gensim.models.Word2Vec()
model.build_vocab(sentences)
model.train(sentences)
```

## Using pre-trained models
We can use pre-trained embeddings instead. You can either download them yourself and then read them from a file (which you probably want to do if you want, for instance, use our Dutch embeddings that we talked about). 
But we can also use some embeddings that come with gensim and that gensim downloads for us.

In [1]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-100")

# alternative that is 16 times as big:
# model = api.load("word2vec-google-news-300")





In [2]:
api.info()['models'].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [3]:
# model = api.load("word2vec-google-news-300")

In [5]:
model.

## Word similarities
Let's play around with word similarieties.

In [3]:
model.most_similar('cat')

[('dog', 0.8798074722290039),
 ('rabbit', 0.7424427270889282),
 ('cats', 0.732300341129303),
 ('monkey', 0.7288709878921509),
 ('pet', 0.719014048576355),
 ('dogs', 0.7163872718811035),
 ('mouse', 0.6915250420570374),
 ('puppy', 0.6800068020820618),
 ('rat', 0.6641027331352234),
 ('spider', 0.6501135230064392)]

In [4]:
animals = ['cat', 'dog', 'horse', 'goldfish', 'lion']
for animal in animals:
    print("A {} is almost the same as a {}.".format(animal, model.most_similar(animal)[0][0]))

A cat is almost the same as a dog.
A dog is almost the same as a cat.
A horse is almost the same as a horses.
A goldfish is almost the same as a crackers.
A lion is almost the same as a dragon.


In [5]:
model.closer_than('man','boy')

['woman']

In [6]:
print(model.distance('man','boy'))
print(model.distance('man','woman'))

0.20851284265518188
0.1676505208015442


And, as we discussed in the lecture, we can literally calculate with the embeddings:

In [7]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7698540687561035),
 ('monarch', 0.6843381524085999),
 ('throne', 0.6755736470222473),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534157752991),
 ('prince', 0.6517034769058228),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.631171703338623),
 ('emperor', 0.6106470823287964),
 ('wife', 0.6098655462265015)]

In [8]:
# We can do the same by hand, but would need to do some manual cleanup afterwards
# (e.g., removing 'king' itself from the results)
model.similar_by_vector(model.get_vector('king') - 
                        model.get_vector('man') + 
                        model.get_vector('woman') )

[('king', 0.8551837205886841),
 ('queen', 0.783441424369812),
 ('monarch', 0.6933802366256714),
 ('throne', 0.6833109855651855),
 ('daughter', 0.6809081435203552),
 ('prince', 0.6713141798973083),
 ('princess', 0.664408266544342),
 ('mother', 0.6579325795173645),
 ('elizabeth', 0.6563301086425781),
 ('father', 0.6392418742179871)]

In [9]:
model.get_vector('king')

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

### TRY YOURSELF

In [10]:
model.most_similar('pythons')

[('iguanas', 0.6652647852897644),
 ('tortoises', 0.6609669923782349),
 ('rattlesnakes', 0.6575519442558289),
 ('crocodiles', 0.6308841109275818),
 ('snakes', 0.6242752075195312),
 ('toads', 0.6229562163352966),
 ('alligators', 0.6116349697113037),
 ('lizards', 0.6112721562385559),
 ('turtles', 0.582252025604248),
 ('salamanders', 0.5808339715003967)]

In [11]:
model.most_similar('python')

[('monty', 0.6886237263679504),
 ('php', 0.586538553237915),
 ('perl', 0.5784407258033752),
 ('cleese', 0.5446676015853882),
 ('flipper', 0.5112984776496887),
 ('ruby', 0.5066928267478943),
 ('spamalot', 0.505638837814331),
 ('javascript', 0.5030569434165955),
 ('reticulated', 0.4983375668525696),
 ('monkey', 0.49764129519462585)]

## Using word embeddings in supervised machine learning 

We need to `sudo pip3 install embeddingvectorizer` first.

In [12]:
from glob import glob 
from collections import defaultdict
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
import embeddingvectorizer

In [13]:
reviews=[]
test=[]
datadir = '/home/damian/Downloads/aclImdb'
for sent in ['pos', 'neg']:
    for file in glob (f"{datadir}/train/{sent}/*.txt"):
        with open(file) as fi:
            reviews.append((fi.read(),sent))

    for file in glob (f"{datadir}/test/{sent}/*.txt"):
        with open(file) as fi:
            test.append((fi.read(),sent))

First, let's look at our old model

In [14]:
mypipe = Pipeline([('vectorizer', TfidfVectorizer()),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='pos', labels = ['pos','neg']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='pos', labels = ['pos','neg']))

Precision:
0.8593267197918361
Recall:
0.84544


Then, let's compare with the new one.
First, we need to convert our word embedding model to the so-called word2vec format

In [22]:
model = api.load("glove-wiki-gigaword-100")
w2vmodel = dict(zip(model.index_to_key, model.vectors))

In [23]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='pos', labels = ['neg','pos']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='pos', labels = ['neg','pos']))

Precision:
0.7312174389909515
Recall:
0.85336


Well, these embeddings seem to be crap. Let's use the new ones.

In [19]:
bigmodel = api.load("word2vec-google-news-300")
# w2vmodel = dict(zip(bigmodel.index2word, bigmodel.syn0)) # GENSIM3
w2vmodel = dict(zip(model.index_to_key, model.vectors))    # GENSIM4

  


In [20]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='pos', labels = ['neg','pos']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='pos', labels = ['neg','pos']))

Precision:
0.7773083475298126
Recall:
0.91256


## Soft cosine similarity

Finally, let's explore soft cosine similarity by re-using our model `model` from above, and our movie reviews `reviews`.

In [21]:
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex

termsim_index = WordEmbeddingSimilarityIndex(model.wv)
documents = [e[0].lower().split() for e in reviews[:100]]

id2word = Dictionary(documents)
bow_corpus = [id2word.doc2bow(document) for document in documents]
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, id2word)  # construct similarity matrix
docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

  """


In [22]:
query = '''Pulp Fiction may be the single best film ever made, and quite appropriately it is by one of the most 
creative directors of all time, Quentin Tarantino. This movie is amazing from the beginning definition of pulp to
the end credits and boasts one of the best casts ever assembled with the likes of Bruce Willis, Samuel L. Jackson, 
John Travolta, Uma Thurman, Harvey Keitel, Tim Roth and Christopher Walken. The dialog is surprisingly humorous for
this type of film, and I think that's what has made it so successful. Wrongfully denied the many Oscars it was 
nominated for, Pulp Fiction is by far the best film of the 90s and no Tarantino film has surpassed the quality of
this movie (although Kill Bill came close). As far as I'm concerned this is the top film of all-time and definitely 
deserves a watch if you haven't seen it.
'''.lower().split()
sims = docsim_index[id2word.doc2bow(query)]  

In [23]:
sims

[(20, 0.9642696380615234),
 (37, 0.9639226198196411),
 (41, 0.9627991318702698),
 (1, 0.9624872803688049),
 (31, 0.9612997770309448),
 (50, 0.9604517221450806),
 (85, 0.96035236120224),
 (13, 0.9603282809257507),
 (17, 0.9570714235305786),
 (71, 0.956860363483429)]

In [24]:
" ".join(documents[50])

"i was blown away by the re-imagined battlestar galactica, a show that always kept me guessing and brought me to tears on more than one occasion. a hardened sci-fi fan, i like to think i can pick out the good stuff from the bs, and this was good stuff.<br /><br />as such, when i first heard about the prospect of a prequel series some months ago i got a sick feeling in my gut. i was afraid that the formula that made battlestar so successful would be reused in caprica, which wouldn't work at all. bsg's story, of a mournful ragged band of survivors, trapped aboard decaying star ships and guided by prophetic vision and a sequence of pseudo-miracles, was perfectly complimented by extraordinary music and a better cast of actors.<br /><br />caprica feels different. where bsg takes place after the fall of a great civilization, caprica portrays that civilization in it's cold and decadent heyday. the overall vibe i got from caprica was similar to that of minority report, minus excessive and coun