# Lab 4: Word Embeddings

Welcome to lab 4! In todays lab we will be looking how to represent a word as a dense vectors. In the homework you will be learn more about the long sparse vectors. 

Word embeddings are popular way of representing text data in problems that are solved by deep learning algorithms. It provides a short dense representaion of a word filled with floating numbers. The hypothesis behind word embeddings is simple: words that occur in the same contexts tend to have similar meanings. 


## Initializing Embeddings

One way to create word embeddings is to start with dense vectors for each token containing random numbers, and then train a model such as a document classifier. After training you will end up with the trained embeddings and model. 


PyTorch has a class for that called Embedding, which is a simple lookup table that stores embeddings of a fixed dictionary and size. You can initialize them randomly or from a pretrained embeddings. 

To initialize the embedding we need to define the dimension of the vector.Usually the dimension varies according to the vocabulary size. It is quite common to use a word embedding of dimension size 50, 100, 256, 300 and sometimes 1000. As the dimension size is a hyper-parameter, we need to play with it during the training phase.



In [None]:
import torch
import torch.nn as nn

torch.manual_seed(1234)

<torch._C.Generator at 0x7f1e54fa8170>

In [None]:
word_to_ix = {"natural":0, "language":1, "processing":2}
word_to_ix

{'language': 1, 'natural': 0, 'processing': 2}

In [None]:
embeddings = nn.Embedding(3, 5) # three words in vocab, 5 dimensional embeddings
embeddings 

Embedding(3, 5)

In [None]:
# lets create a lookup tensor for word "natural"
lookup_tensor = torch.tensor(word_to_ix["natural"], dtype=torch.long)
lookup_tensor

tensor(0)

The following sets up an embedding layer:

In [None]:
emb_layer = embeddings(lookup_tensor)
emb_layer

tensor([ 0.0461,  0.4024, -1.0115,  0.2167, -0.6123],
       grad_fn=<EmbeddingBackward>)

### Loading pretrained word embeddings 

Training your own word embeddings would be useful when we are working in specific domains such as medicine and manufacturing, where we have lot of data to train the embeddings. When we have little data on which we cannot meaningfully train the embeddings, we can use embeddings, which are trained on different data corpuses such as Wikipedia, Google News etc. 

There are many pretrained word embeddings available: Word2Vec, fastText, GloVe, ELMo. We can use these embeddings to initialize the weights instead of initializing them randomly. 

When you download the pretrained word embeddings, they usually look like this: 

say_VERB -0.008861 0.097097 0.100236 0.070044 -0.079279 0.000923 -0.012829 0.064301 -0.029405 -0.009858 ...<br>
go_VERB 0.010490 0.094733 0.143699 0.040344 -0.103710 -0.000016 -0.014351 0.019653 0.069472 -0.046938 ...<br>
make_VERB -0.013029 0.038892 0.008581 0.056925 -0.100181 0.011566 -0.072478 0.156239 0.038442 -0.073817 ... <br>
thirty-six_NUM 0.058545 0.089598 0.052056 0.013421 -0.022304 -0.056648 -0.017670 0.095910 -0.028729 ...

In [None]:
weight = torch.FloatTensor([[1, 1.2, 3,], [5,1.4,3.2]])
embedding = nn.Embedding.from_pretrained(weight)
input = torch.LongTensor([1]) # get for index 1
embedding(input)

tensor([[5.0000, 1.4000, 3.2000]])

We can download embeddings with torchtext.vocab: 

In [None]:
from torchtext.vocab import GloVe 
vectors = GloVe(name='6B', dim=100)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.34MB/s]                           
100%|█████████▉| 399055/400000 [00:18<00:00, 21792.62it/s]

Let's look inside: 

In [None]:
vectors["chicken"]

tensor([-0.3194,  0.6435,  0.0617, -0.2347, -0.4667,  0.4594,  0.8097,  0.2657,
         0.1744, -0.2897, -0.7720,  0.2944,  1.1188,  0.5489, -0.2323,  0.6268,
        -0.1981, -0.3967,  0.0751,  0.1399,  0.3052,  0.8838, -0.0324, -0.9825,
         0.6157,  1.6974,  0.1439, -0.1822, -0.5754,  0.5123, -0.0438,  0.9043,
         0.5499, -0.2778, -0.0383,  0.8688,  0.0274, -0.0621, -0.1154, -1.1948,
         0.9122, -1.3764, -0.6007, -1.2390,  0.7174,  0.0060, -1.2784, -0.6036,
         0.0875, -0.9329, -0.3817,  0.1532, -0.0295,  0.5951, -1.3351, -0.8525,
        -0.2539,  0.1549,  0.6360,  0.4603,  0.1127,  0.7312,  0.7924,  0.6403,
         0.8722, -0.1492, -0.3729, -0.0899, -0.3083,  0.1444, -0.2168,  0.4361,
         0.2724,  1.1278,  0.2743,  0.5571, -0.9089,  0.2880,  0.4200,  0.9972,
         0.6990, -0.3730, -0.4469,  0.7007, -0.4779, -0.3068, -0.1777,  0.7048,
         0.0186,  0.2088,  0.1604,  0.1789, -0.3458, -0.5430, -1.3805, -0.8760,
         0.3000, -0.6880,  0.7075, -0.05

In [None]:
vectors["Chicken"]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])

## Training Word Embeddings

We can train our own word embeddings using different algorithms. Word2Vec provides two different algorithms: Continuous Bag-Of-Words and Skip-Gram. Both are shown graphically in the image below. 
<img src="https://miro.medium.com/max/2400/1*cuOmGT7NevP9oJFJfVpRKA.png">
Continues Bag-Of-Words predicts the center word given the context. Skip-Gram predicts the context words given the center word as an input. 

You can read more about these algorithms from the original article: [Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.](https://arxiv.org/abs/1310.4546)

We can use Gensim to train Word2Vec embeddings. 

In [None]:
from gensim.models import Word2Vec
import spacy
nlp = spacy.load('en_core_web_sm')

data = open("structure_and_functions_of_the_body.txt", 'r', encoding='utf-8').read()
doc = nlp(data)


In [None]:
texts = [ [token.lemma_ for token in sent if token.lemma_!='-PRON-'] for sent in doc.sents]

In [None]:
texts[2]

['Most',
 'of',
 ',',
 'moreover',
 ',',
 'separate',
 'the',
 'anatomy',
 'from',
 'the',
 '\n',
 'physiology',
 'and',
 'all',
 'treat',
 'the',
 'different',
 'system',
 'of',
 'tissue',
 'separately',
 ',',
 '\n']

In [None]:
import multiprocessing
cores = multiprocessing.cpu_count()

model = Word2Vec(size=100, window=2, min_count=5, workers=cores-1, negative=10)

Now we need to build the vocabulary table.

In [None]:
model.build_vocab(texts)

Let's train the model. 

In [None]:
model.train(texts, total_examples=model.corpus_count, epochs=100, report_delay=1)

(4516570, 8388900)

In [None]:
model.init_sims(replace=True) # saves memory (cannot continue training after doing)

Let's expore this model. We can ask the model what are the similar words: 

In [None]:
model.wv.most_similar(positive=["body"])

[('air', 0.34284690022468567),
 ('same', 0.3393501341342926),
 ('intestine', 0.3342682421207428),
 ('case', 0.3135792911052704),
 ('other', 0.3043676018714905),
 ('place', 0.3041819930076599),
 ('state', 0.2922523319721222),
 ('oxygen', 0.2879457473754883),
 ('fetus', 0.28457996249198914),
 ('development', 0.2835215926170349)]

In [None]:
model.wv.most_similar(positive=["body", "muscle"])

[('intestine', 0.3504788875579834),
 ('hand', 0.34708523750305176),
 ('case', 0.3384869396686554),
 ('femur', 0.3340495526790619),
 ('bicep', 0.331558495759964),
 ('latter', 0.3288252055644989),
 ('sphincter', 0.32769477367401123),
 ('head', 0.32016849517822266),
 ('fiber', 0.31714192032814026),
 ('same', 0.3093774914741516)]

In [None]:
# odd one out
model.wv.doesnt_match(['fiber', 'motion', 'muscle'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'motion'

## Issues with Word2Vec embeddings

As, Word2Vec creates emebdding for each word seen in the training data, it cannot handle words that it did not encounter during training. This leaves us with many out of vocabulary words. 

In [None]:
# not existing in the the embeddings 
model["chicken"]

  


KeyError: ignored

Words can have many senses, meaning that depending on the context, word can take on different meanings. Let's consider the word *nail*. It could be a upper surface of the tip of the finger or small metal spike. Word2Vec only learns one representation of this word. 

**What can we use instead of that? **

Another issue is with morphologically rich languages. 



## FastText

To solve all of those issues, Bojanowski et al. proposed a new embedding methods called FastText. The main idea of FasText is the use of the internal structure of a word to improve vector representation obtained from skip-gram. 

Let's go over a simple example. We have a sentence "I am drinking coffee." and we need to predict the context words "am" and "coffee" from the center word "drinking". 
1. The center word is split into character n-grams. Embedding fot the center word is the sum of the embeddings of the character n-grams and the word itself. 
2. Context words embeddings are directly taken from the embedding table (no n-grams are added). 
3. Collect negative samples. 
4. Dot product between the center and context words is taken and then sigmoid function is applied to this dot product to get a match score between 0 and 1. 
5. Embeddings are updated based on the loss. This will bring the actual context words closer to the center words and further from the negative samples.

In [None]:
from gensim.models.fasttext import FastText
model_ft = FastText(size=100)

# build the vocabulary
model_ft.build_vocab(texts)

# train the model

model_ft.train(sentences=texts, epochs=100, total_examples=model_ft.corpus_count, total_words=model_ft.corpus_total_words)

In [None]:
wv = model_ft.wv
wv['chicken']

array([ 5.84540404e-02, -1.71275243e-01, -4.16463196e-01, -2.62860116e-02,
       -4.49597865e-01,  1.43712914e+00,  8.19210410e-01,  1.10958207e+00,
        1.45769030e-01, -2.78505474e-01, -3.66446495e-01,  5.04720628e-01,
       -6.00762904e-01, -6.55672073e-01, -2.27815628e-01,  1.39593139e-01,
        7.63777554e-01, -4.29498941e-01,  7.92346776e-01, -6.36529401e-02,
       -6.27341270e-01, -1.01682234e+00, -3.43093306e-01,  3.37230295e-01,
        7.19035923e-01,  4.84142214e-01, -4.50526446e-01, -3.02567095e-01,
       -5.82112193e-01, -1.29428244e+00,  6.22869074e-01,  9.75813627e-01,
       -1.37357628e-02, -7.38758981e-01,  1.18457162e+00,  2.62065291e-01,
       -8.81629512e-02,  1.59774661e-01, -1.22859657e+00, -1.31248736e+00,
       -9.62204397e-01, -1.30527115e+00, -2.00505972e-01,  6.23708010e-01,
        3.11680466e-01,  1.23568738e+00,  1.17132699e+00,  4.75794345e-01,
        1.90103307e-01,  4.72816199e-01, -2.50215977e-01,  3.78694266e-01,
        8.06592047e-01, -

In [None]:
wv.most_similar(positive=["body", "muscle"])

[('corpuscle', 0.5593864321708679),
 ('Muscles', 0.506779134273529),
 ('numerous', 0.4341539740562439),
 ('manner', 0.418964147567749),
 ('contraction', 0.4000856280326843),
 ('constriction', 0.39654988050460815),
 ('action', 0.38593339920043945),
 ('portion', 0.38329970836639404),
 ('hand', 0.382585346698761),
 ('particle', 0.38190367817878723)]

In [None]:
wv.most_similar(positive=["body"])

[('activity', 0.37730681896209717),
 ('absorption', 0.37398838996887207),
 ('nitrogenous', 0.3710705041885376),
 ('organ', 0.36998969316482544),
 ('nervous', 0.3642290234565735),
 ('numerous', 0.36146074533462524),
 ('plasma', 0.3611193001270294),
 ('active', 0.3418109118938446),
 ('formation', 0.341249942779541),
 ('every', 0.3340113162994385)]