# Part IV. Word Embeddings

---

## What is word embedding?

Word embedding is a method to map words into continous vectors. The generated vectors are the semantic representations of the words. There are a lot of different algorithms, the most commonly used is called `word2vec`.

<img src="resources/w2v-context-words.png">
<img src="resources/w2v-king-queen-vectors.png" align="left" width="400px">
<img src="resources/w2v-king-queen-composition.png" align="right" width="400px">

<div style="clear:both">
    <small><a href="https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/">Forrás</a></small>
</div>

---

## Word embeddings in practice

One of the most accessable library which can be used to generate word2vec - a type of werd embedding - is `gensim`. The word2vec algorithm requires preprocessed sentences in order to train.

### 1. Corpus preparation

- Acquire the data
- Split into sentences
- Tokenize the sentences

In [None]:
import tqdm
import spacy
import requests

from bs4 import BeautifulSoup
from nltk.tokenize import PunktSentenceTokenizer

In [None]:
nlp = spacy.load('en')
tokenizer = PunktSentenceTokenizer()

In [None]:
def tokenize(sentences):
    return [[token.lemma_ for token in nlp(sent) 
             if not token.is_stop
             and not token.is_punct
             and not token.is_space
             and not token.lemma_ == '-PRON-']
            for sent in tqdm.tqdm(sentences)]

In [None]:
LOTR = {}
for i in range(3):
    url ='http://ae-lib.org.ua/texts-c/tolkien__the_lord_of_the_rings_{book}__en.htm'
    resp = requests.get(url.format(book=i+1)).content
    text = BeautifulSoup(resp, "html.parser").getText()
    sentences = tokenizer.tokenize(text)
    LOTR[i] = tokenize(sentences)

### 2. Training
Now that we transformed the raw data to the desired format we can train the model.

In [None]:
import gensim

In [None]:
model = gensim.models.Word2Vec(LOTR[0] + LOTR[1] + LOTR[2], 
                               size=50, window=5, min_count=5, iter=5, seed=42)

### 3. Using the trained model
The trained model can be used to:
- search for similar items

In [None]:
model.most_similar('hobbit')

- search for analogies

In [None]:
model.wv.most_similar(positive=['sméagol', 'hobbit'], negative=['ring'])

In [None]:
vector = model['sméagol'] - model['ring'] + model['hobbit']
model.wv.most_similar([vector])

- find which word from the given list doesn't go with the others?

In [None]:
model.wv.doesnt_match(['gandalf', 'frodo', 'saruman', 'aragorn'])

- visualize them

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
from sklearn.manifold import TSNE

In [None]:
ringwords = ['ring'] + [w for w, s in model.most_similar(['ring'], topn=20)]
ring = np.array([word in ringwords for word in model.wv.vocab.keys()])

In [None]:
X = model[model.wv.vocab]

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(X_tsne[~ring][:, 0], X_tsne[~ring][:, 1], c='b', alpha=.3)
ax.scatter(X_tsne[ring][:, 0], X_tsne[ring][:, 1], c='r')

### 4. Saving models

In [None]:
model.save('data/lotr_w2v.model')

### 5. Loading pre-trained models
- gensim models

In [None]:
model = gensim.models.Word2Vec.load('data/lotr_w2v.model')

- other, binary models

In [None]:
# set binary to true if the model is compressed
model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True) 

In [None]:
model.most_similar('Frodo')

## Further reading

- [gensim word2vec tutorial](https://rare-technologies.com/word2vec-tutorial/)
- [demistifying word2vec](http://www.deeplearningweekly.com/blog/demystifying-word2vec)
- [word2vec by hand blogpost](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/)
- [visualizing embeddings](https://github.com/anvaka/word2vec-graph)