# Word to vec

How do we understand meaning of previosl unseen word? We are searching for this word in a context.

## Theory

There are two types of word to vec.

### Skipgram

Predicting outside word $o$ from central $c$. We have two embedding mattrix $u$ and $v$.

$P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}$.

![](https://i.ibb.co/xgT4k8b/2020-10-02-10-10-21.png)

More formally we need to maximize Likelohood:
$$
L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m, j \neq 0} P\left(w_{t+j} \mid w_{t}, \theta\right)
$$

$$
L_{\log}(\theta) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P\left(w_{t+j} \mid w_{t}, \theta\right) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log \frac{\exp \left(u_{t+j}^{T} v_{t}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)} = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} u_{t+j}^{T} v_{t} - \log \sum_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)
$$

$$
loss = -L_{\log}
$$

Let's count derivative!

**Reminder**

$$\frac{\partial x^T y}{\partial y} = x$$

Let $o = t+j$, for one step:

$$
\frac{\partial L_{log}(\theta)}{\partial v_t} = u_o - \dfrac{1}{\sum_w\exp(u_w^T v_t)}\cdot\sum_x \exp(u_x^T v_t) u_x = u_o - \sum_x \frac{\exp(u_x^T v_t)}{\sum_w \exp(u_w^T v_t)} u_x = \\ = u_0 - \sum_x P(u_x| v_t) u_x
$$

### CBOW

![](https://lena-voita.github.io/resources/lectures/word_emb/w2v/cbow_skip-min.png)

## Practice

### Variant 1

```python
class Model(nn.Module):
    def __init__(self, voc_size, emb_dim):
        self.u = nn.Embedding(voc_size, emb_dim)
        self.v = nn.Embedding(voc_size, emb_dim)

w2v = Model(...)

def step(word, context):
    for c_word in context:
        loss = - w2v.u(word).T.dot(w2v.v(c_word))
        cum_exp = 0
        for i in range(voc_size):
            if i == c_word:
                continue
            cum_exp += w2v.u(word).T.dot(w2v.v(c_word)).exp()
        loss += torch.log(cum_exp)
        loss.backward()
        ...
```

### Variant 2

![](https://i.ibb.co/qydjBbv/2020-10-02-12-16-33.png)

```python
class Model(nn.Module):
    def __init__(self, voc_size, emb_dim):
        self.u = nn.Embedding(voc_size, emb_dim)
        self.v = nn.Linear(emb_dim, voc_size, bias=False)

    def forward(self, x):
        return self.v(self.u(x))

w2v = Model(...)
criterion = nn.CrossEntropyLoss()

def step(word, context):
    for c_word in context:
        preds = w2v(word)
        loss = criterion(preds, c_word)
        loss.backward()
        ...
```


In [1]:
!curl -O http://mattmahoney.net/dc/text8.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29.8M  100 29.8M    0     0   856k      0  0:00:35  0:00:35 --:--:--  868k


In [2]:
!unzip text8.zip

Archive:  text8.zip
  inflating: text8                   


In [3]:
!pip install -q catalyst

[K     |████████████████████████████████| 481kB 4.4MB/s 
[K     |████████████████████████████████| 317kB 10.9MB/s 
[K     |████████████████████████████████| 163kB 12.0MB/s 
[K     |████████████████████████████████| 71kB 8.9MB/s 
[?25h

In [4]:
import re
from collections import Counter
from tqdm.notebook import tqdm
import numpy as np

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from catalyst import dl

In [5]:


class W2VCorpus:
    def __init__(
        self, path, voc_max_size: int = 40000, min_word_freq: int = 20, max_corp_size=5e6
    ):
        corpus = []
        sentences = []
        with open(path, "r") as inp:
            for line in inp:
                corpus.append(line.split())
                sentences.append(line)
        corpus = np.array(corpus)
        self.corpus = corpus
        most_freq_word = \
            Counter(' '.join(sentences).split()).most_common(voc_max_size)
        most_freq_word = np.array(most_freq_word)
        most_freq_word = \
            most_freq_word[most_freq_word[:, 1].astype(int) > min_word_freq]
        
        print('Vocabulary size is:' + str(len(most_freq_word)))
        self.vocabulary = set(most_freq_word[:, 0])
        self.vocabulary.update(["<PAD>"])
        self.vocabulary.update(["<UNK>"])
        self.word_freq = most_freq_word
        self.idx_to_word = dict(list(enumerate(self.vocabulary)))
        self.word_to_idx = \
            dict([(i[1], i[0]) for i in enumerate(self.vocabulary)])
        self.W = None
        self.P = None
        self.positive_pairs = None
        
    def make_positive_dataset(self, window_size=2):
        """take corpus and make positive examples for skipgram or CBOW
           like: [1234], [[3333, 1111, 2222, 4444]]"""
        if not self.W is None:
            return self.W, self.P
        W = []
        P = []
        pbar = tqdm(self.corpus)
        pbar.set_description('Creating context dataset')
        for message in pbar:

            if len(self.corpus) == 1:
                iter_ = tqdm(enumerate(message), total=len(message))
            else:
                iter_ = enumerate(message)
            
            for idx, word in iter_:
                if word not in self.vocabulary:
                    word = "<UNK>"
                start_idx = max(0, idx - window_size)
                end_idx = min(len(message), idx+window_size+1)
                pos_in_window = window_size
                if idx - window_size < 0:  # start of the sentence
                    pos_in_window += idx - window_size
                    
                co_words = message[start_idx:end_idx]
                co_words = np.delete(co_words, pos_in_window)
                filtered_co_words = []
                
                for co_word in co_words:
                    if co_word in self.vocabulary:
                        filtered_co_words.append(co_word)
                    else:
                        filtered_co_words.append("<UNK>")
                while len(filtered_co_words) < 2*window_size:
                    filtered_co_words.append("<PAD>")
                W.append(self.word_to_idx[word])
                co_word_idx = [self.word_to_idx[co_word] for co_word in filtered_co_words]
                P.append(co_word_idx)
        self.W = W
        self.P = P
        del self.corpus
        return W, P
    
    def make_positive_pairs(self):
        if not self.positive_pairs is None:
            return self.positive_pairs
        if self.W is None:
            self.make_positive_dataset()
        pairs = []
        pbar = tqdm(zip(self.W, self.P), total=len(self.W))
        pbar.set_description('Creating positive pairs')
        for w, p in pbar:
            for cur_p in p:
                if cur_p != self.word_to_idx["<PAD>"]:  # pad
                    pairs.append([w, cur_p])
        self.positive_pairs = pairs
        return pairs


In [6]:
corp = W2VCorpus("text8")

pairs = corp.make_positive_pairs()

Vocabulary size is:30964


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=17005207.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=17005207.0), HTML(value='')))




In [7]:
pairs

[[24216, 1926],
 [24216, 26577],
 [1926, 24216],
 [1926, 26577],
 [1926, 29852],
 [26577, 24216],
 [26577, 1926],
 [26577, 29852],
 [26577, 18045],
 [29852, 1926],
 [29852, 26577],
 [29852, 18045],
 [29852, 27486],
 [18045, 26577],
 [18045, 29852],
 [18045, 27486],
 [18045, 23762],
 [27486, 29852],
 [27486, 18045],
 [27486, 23762],
 [27486, 13736],
 [23762, 18045],
 [23762, 27486],
 [23762, 13736],
 [23762, 28787],
 [13736, 27486],
 [13736, 23762],
 [13736, 28787],
 [13736, 28243],
 [28787, 23762],
 [28787, 13736],
 [28787, 28243],
 [28787, 7288],
 [28243, 13736],
 [28243, 28787],
 [28243, 7288],
 [28243, 20891],
 [7288, 28787],
 [7288, 28243],
 [7288, 20891],
 [7288, 5271],
 [20891, 28243],
 [20891, 7288],
 [20891, 5271],
 [20891, 2822],
 [5271, 7288],
 [5271, 20891],
 [5271, 2822],
 [5271, 18239],
 [2822, 20891],
 [2822, 5271],
 [2822, 18239],
 [2822, 6131],
 [18239, 5271],
 [18239, 2822],
 [18239, 6131],
 [18239, 16752],
 [6131, 2822],
 [6131, 18239],
 [6131, 16752],
 [6131, 27486],

In [None]:
class W2VDataset(Dataset):
    def __init__(self, pairs):
        self.pairs = pairs

    def __getitem__(self, idx):
        return {
            "word": torch.tensor(self.pairs[idx][0]),
            "context": torch.tensor(self.pairs[idx][1])
        }

    def __len__(self):
        return len(self.pairs)

In [None]:
train_ds = W2VDataset(pairs)
train_dl = DataLoader(train_ds, batch_size=2048)
loaders = {"train": train_dl}

In [None]:
class W2VModel(nn.Module):
    def __init__(self, voc_size, emb_dim):
        super().__init__()
        self.encoder = nn.Embedding(voc_size, emb_dim)
        self.decoder = nn.Linear(emb_dim, voc_size, bias=False)
        self.voc_size = voc_size
        self.emb_dim = emb_dim
        self.init_emb()

    def forward(self, word):
        return self.decoder(self.encoder(word))

    def init_emb(self):
        """
        init the weight as original word2vec do.
        """
        initrange = 0.5 / self.emb_dim
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.weight.data.uniform_(0, 0)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir .

In [None]:
from catalyst import dl

model = W2VModel(len(corp.vocabulary), 300)
runner = dl.SupervisedRunner(
    input_key=["word"], input_target_key=["context"]
)

In [None]:
runner.train(
    model=model,
    optimizer=torch.optim.Adam(model.parameters()),
    loaders=loaders,
    criterion=nn.CrossEntropyLoss(),
    callbacks = [dl.CriterionCallback(input_key="context")],
    num_epochs=1,
    logdir="simple_w2v_1",
    verbose=False
)

## More tricks

### Negative sampling

Instead of updating all context vectors we can sample 5-10.

$$
loss = -\log \sigma\left(u_{context}^{T} v_{center}\right)-\sum_{w \in\left\{w_{i_{1}}, \ldots, w_{i_{K}}\right\}} \log \sigma\left(-u_{w}^{T} v_{center}\right)
$$

**How can we sample negative words?**

According to their probability: $p_{sample} (w) = p_{word} (w)$? Why not?

Information $= -\log(P)$


### Not all of the words are equally important

$$
P\left(w_{i}\right)=1-\sqrt{\frac{t h r}{f\left(w_{i}\right)}}
$$

$thr \approx 10^-5$.

### Distance between center and context

![](https://lena-voita.github.io/resources/lectures/word_emb/research/w2v_position-min.png)


### Default settings

**Method:** Skipgram

**Negative sample size**: about 5 for big dataset, 10-20 for small

**Embedding space dim:** 300 (the quality remains the same for higher dims)

**Window size**: 5-10

In [None]:
# in case you want to try to create embeddings yourself

all_embeddings = []
all_words = []


for word, idx in corp.word_to_idx.items():
    with torch.no_grad():
        current_emb = model.embedding(torch.tensor(idx).to(device))
        current_emb = current_emb.cpu().detach().numpy()
        all_embeddings.append(current_emb)
        all_words.append(word)
all_embeddings = np.array(all_embeddings)
all_words = np.array(all_words).astype(str)
np.savetxt("embeddings_t8.tsv", all_embeddings[:5000], delimiter="\t")

with open("words_t8.tsv", 'w') as out:
    for word in all_words[:5000]:
        out.write(word + '\n')

# Embedding space

[Pre-trained model embedding space](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/elephantmipt/4a46fe320b4eadf6ba47f0c073968244/raw/2489cf5c83032a19ecb4e38c8e51b779a1e5dc59/configs_t8.json)

In [None]:
import gensim.downloader as api

model = api.load('word2vec-google-news-300')

In [None]:
model.most_similar("door")

### Syntactic

$v_{kings} - v_{king} + v_{queen} \approx v_{queens}$

In [None]:
model.most_similar(positive=['kings', 'queen'], negative=['king'], topn=1)

### Semantic
$v_{king} - v_{man} + v_{woman} \approx v_{queen}$

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

In [None]:
model.most_similar(positive=['USA', 'vodka'], negative=['Russia'], topn=1)

## Bias

In [None]:
model.most_similar(positive=['woman', 'director'], negative=['man'])

President

In [None]:
model.similarity("president", "man")

In [None]:
model.similarity("president", "woman")

Crime

In [None]:
string = "Similarity to word `crime`:\n Black: "
string += str(model.similarity("crime", "black"))
string += "\n Latino: " + str(model.similarity("crime", "latino"))
string += "\n Asian: " + str(model.similarity("crime", "asian"))
string += "\n White: " + str(model.similarity("crime", "white"))
print(string)

***What can we do?***



1.   Train linear model for classification task
2.   Project to decision boundary
3.   goto 1



<img src="https://lena-voita.github.io/resources/lectures/word_emb/papers/null_it_out-min.png" alt="drawing" width="600"/>

## Credentials and References

1. YDS NLP Course. [Lecture about word embeddings.](https://lena-voita.github.io/nlp_course/word_embeddings.html)
2. [Paper about bias.](https://www.aclweb.org/anthology/2020.acl-main.647.pdf)