# First steps with word embeddings
Big Data and Automated Content Analysis

Damian Trilling


This notebook shows you some first steps of how to work with word embeddings in Gensim. 


**NB  Some of these things may take a lot of memory and/or computing power. **

## Training word embeddings
Training word embedings with gensim is very simple (see example code below - even though you wildo l need to specift some extra options). However, you need a massive dataset for that (think of millions of documents). We're therefore not going to do this in class.

```
model = gensim.models.Word2Vec()
model.build_vocab(sentences)
model.train(sentences)
```

## Using pre-trained models
We can use pre-trained embeddings instead. You can either download them yourself and then read them from a file (which you probably want to do if you want, for instance, use our Dutch embeddings that we talked about). 
But we can also use some embeddings that come with gensim and that gensim downloads for us.

In [28]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-100")

# alternative that is 16 times as big:
# model = api.load("word2vec-google-news-300")

In [7]:
# model = api.load("word2vec-google-news-300")



## Word similarities
Let's play around with word similarieties.

In [13]:
model.most_similar('cat')

[('dog', 0.8798074722290039),
 ('rabbit', 0.7424426674842834),
 ('cats', 0.7323004007339478),
 ('monkey', 0.7288709878921509),
 ('pet', 0.7190139889717102),
 ('dogs', 0.7163872718811035),
 ('mouse', 0.6915250420570374),
 ('puppy', 0.6800068020820618),
 ('rat', 0.6641027331352234),
 ('spider', 0.6501135230064392)]

In [35]:
animals = ['cat', 'dog', 'horse', 'goldfish', 'lion']
for animal in animals:
    print("A {} is almost the same as a {}.".format(animal, model.most_similar(animal)[0][0]))

A cat is almost the same as a dog.
A dog is almost the same as a cat.
A horse is almost the same as a horses.
A goldfish is almost the same as a crackers.
A lion is almost the same as a dragon.


In [17]:
model.closer_than('man','boy')

['woman']

In [21]:
print(model.distance('man','boy'))
print(model.distance('man','woman'))

0.20851284265518188
0.1676505208015442


And, as we discussed in the lecture, we can literally calculate with the embeddings:

In [27]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755735874176025),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534753799438),
 ('prince', 0.6517034769058228),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.6311717629432678),
 ('emperor', 0.6106470823287964),
 ('wife', 0.6098655462265015)]

In [51]:
# We can do the same by hand, but would need to do some manual cleanup afterwards
# (e.g., removing 'king' itself from the results)
model.similar_by_vector(model.get_vector('king') - model.get_vector('man') + model.get_vector('woman') )

[('king', 0.8551837205886841),
 ('queen', 0.783441424369812),
 ('monarch', 0.6933802366256714),
 ('throne', 0.6833109855651855),
 ('daughter', 0.680908203125),
 ('prince', 0.6713142395019531),
 ('princess', 0.664408266544342),
 ('mother', 0.6579325199127197),
 ('elizabeth', 0.6563301086425781),
 ('father', 0.6392419338226318)]

### TRY YOURSELF

## Using word embeddings in supervised machine learning 

We need to `sudo pip3 install embeddingvectorizer` first.

In [53]:
from glob import glob 
from collections import defaultdict
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
import embeddingvectorizer

In [5]:
reviews=[]
test=[]

for file in glob ("/home/damian/Downloads/aclImdb/train/pos/*.txt"):
    with open(file) as fi:
        reviews.append((fi.read(),"1"))
nopostr=len(reviews)
print ("Added",nopostr,"positive reviews")  

for file in glob ("/home/damian/Downloads/aclImdb/train/neg/*.txt"):
    with open(file) as fi:
        reviews.append((fi.read(),"-1"))
nonegtr=len(reviews)-nopostr
print ("Added",nonegtr,"negative reviews")  

for file in glob ("/home/damian/Downloads/aclImdb/test/pos/*.txt"):
    with open(file) as fi:
        test.append((fi.read(),"1"))
noposte=len(test)
print ("Added",noposte,"positive reviews")  

for file in glob ("/home/damian/Downloads/aclImdb/test/neg/*.txt"):
    with open(file) as fi:
        test.append((fi.read(),"-1"))
nonegte=len(test)-noposte
print ("Added",nonegte,"negative reviews")  

Added 12500 positive reviews
Added 12500 negative reviews
Added 12500 positive reviews
Added 12500 negative reviews


First, let's look at our old model

In [13]:
mypipe = Pipeline([('vectorizer', TfidfVectorizer()),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))

Precision:
0.8598565139409751
Recall:
0.84376


Then, let's compare with the new one.
First, we need to convert our word embedding model to the so-called word2vec format

In [54]:
#bigmodel = api.load("word2vec-google-news-300")
# w2vmodel = dict(zip(bigmodel.index2word, bigmodel.syn0))

w2vmodel = dict(zip(model.index2word, model.syn0))

  after removing the cwd from sys.path.


In [55]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))

Precision:
0.8468904556455856
Recall:
0.64384


Well, these embeddings seem to be crap. Let's use the new ones.

In [58]:
bigmodel = api.load("word2vec-google-news-300")
w2vmodel = dict(zip(bigmodel.index2word, bigmodel.syn0))

  


In [59]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit([e[0] for e in reviews], [e[1] for e in reviews])
predictions = mypipe.predict([e[0] for e in test])

print('Precision:')
print(metrics.precision_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))
print('Recall:')
print(metrics.recall_score([e[1] for e in test],predictions,pos_label='1', labels = ['-1','1']))

Precision:
0.9180983252296057
Recall:
0.67976
