# Part IV. Word Embeddings

---

## What is word embedding?

Word embedding is a method to map words into continous vectors. The generated vectors are the semantic representations of the words. There are a lot of different algorithms, the most commonly used is called `word2vec`.

<img src="resources/w2v-context-words.png">
<img src="resources/w2v-king-queen-vectors.png" align="left" width="400px">
<img src="resources/w2v-king-queen-composition.png" align="right" width="400px">

<div style="clear:both">
    <small><a href="https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/">Forrás</a></small>
</div>

---

## Word embeddings in practice

One of the most accessable library which can be used to generate word2vec - a type of werd embedding - is `gensim`. The word2vec algorithm requires preprocessed sentences in order to train.

### 1. Corpus preparation

- Acquire the data
- Split into sentences
- Tokenize the sentences

In [None]:
import tqdm
import spacy
import requests

from bs4 import BeautifulSoup
from nltk.tokenize import PunktSentenceTokenizer

In [None]:
nlp = spacy.load('en')
tokenizer = PunktSentenceTokenizer()

In [None]:
def tokenize(sentences):
    return [[token.lemma_ for token in nlp(sent) 
             if not token.is_stop
             and not token.is_punct
             and not token.is_space
             and not token.lemma_ == '-PRON-']
            for sent in tqdm.tqdm(sentences)]

In [None]:
LOTR = {}
for i in range(3):
    url ='http://ae-lib.org.ua/texts-c/tolkien__the_lord_of_the_rings_{book}__en.htm'
    resp = requests.get(url.format(book=i+1)).content
    text = BeautifulSoup(resp, "html.parser").getText()
    sentences = tokenizer.tokenize(text)
    LOTR[i] = tokenize(sentences)

### 2. Training

In [None]:
import gensim

In [None]:
model = gensim.models.Word2Vec(LOTR[0] + LOTR[1] + LOTR[2], seed=42)

### 3. Using the trained model

In [None]:
model.most_similar('hobbit')

In [None]:
model.wv.most_similar(positive=['frodo', 'dwarf'], negative=['hobbit'])

### 4. Saving models

In [None]:
model.save('data/lotr_w2v.model')

### 5. Loading pre-trained models
- gensim models

In [None]:
model = gensim.models.Word2Vec.load('data/lotr_w2v.model')

- other, binary models

In [None]:
# set binary to true if the model is compressed
model = gensim.models.Word2Vec.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True) 

### 6. Using semantic vectors

## Further reading

- [gensim word2vec tutorial](https://rare-technologies.com/word2vec-tutorial/)