# word2vec

Consider a sentence:

`the general [MASK] the troops`

We want to predict the `[MASK]` token using the context words: `the`, `general`, `the`, `troops`.



In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import namedtuple

## CBOW

The CBOW model for word2vec stands for Continuous Bag of Words.

We take the average (mean) of the word vectors for the context words and use it to predict the target word (the word that can be viewed as the `[MASK]` token). 

This is an example of self-supervised learning because each `[MASK]` token is created from some text and so we know the right output word in each training example.

The original model used two different embeddings: one for the target words as they appear as `[MASK]` tokens and another for the context words used to predict the target word. 

Pytorch has a simple way to go from one-hot vectors based on word indices to word embeddings (aka word vectors) using the `nn.Embedding` class.

We create a hidden layer in this simple neural network by taking the mean of the context words. Pytorch forces you to think in terms of a batch of training/test examples at a time and it is typically the first dimension of the tensor we create. 

To take the mean of the input embeddings we start with the embeddings for all the inputs using a tensor of size `B x N x E` where `B` is the batch size, `N` is the number of inputs and `E` is the embedding size. We take the mean of dimension 1 (the `N` dimension) and we get a new tensor of size `B x 1 x E` which we can convert to a `B x E` tensor using the Pytorch `squeeze` function.

We pass this through a simple linear network (fully connected single hidden layer) to get the hidden representation which is then used to predict the target word by mapping it back up to the vocabulary size using `self.expand` in this implementation.

In [2]:
class CBOW(nn.Module):
    def __init__(self, embedding_size=100, vocab_size=-1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.expand = nn.Linear(embedding_size, vocab_size)

    def forward(self, inputs):
        hidden = self.embed(inputs).mean(1).squeeze(1) # batch size x embedding_size
        return self.expand(hidden)

### Create the dataset

The dataset creation reads the raw text data and creates the word to index and index to word dictionaries.

For a window size of 2 it creates each instance for training which has a `context` of a list of context words and the target `mask_token` which is the answer the neural network should predict to get zero loss.


In [3]:
from nltk.corpus import brown

brown.words()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [4]:
Instance = namedtuple('Instance', ['context', 'mask_token'])

def create_dataset():
    tokenized_text = [w.lower() for w in brown.words()]
    vocab = set(tokenized_text)
    word_to_idx = dict()
    idx_to_word = dict()

    # create word to index mapping and vice versa
    for i, word in enumerate(vocab):
        word_to_idx[word] = i
        idx_to_word[i] = word

    # generate training data using two words of context before and after [MASK] 
    data = []
    for i in range(2, len(tokenized_text)-2):
        context = [
            tokenized_text[i-2],
            tokenized_text[i-1],
            tokenized_text[i+1],
            tokenized_text[i+2],
        ]
        masked_token = tokenized_text[i]
        context_idxs = [word_to_idx[w] for w in context]
        mask_token_idx = word_to_idx[masked_token]
        # data is a list of context, mask_token tuples
        data.append(Instance(context=context_idxs, mask_token=mask_token_idx))
    
    return data, word_to_idx, idx_to_word

### Training the model

Training the model is a fairly standard loop in Pytorch which doesn't change much once the model is defined and it knows how to compute the forward pass through the model. The backward pass through the model for standard `nn` components in Pytorch is done automatically so you don't need to specify the `backward` function yourself. If you create a novel `nn.Module` component that does not use standard Pytorch modules then you will need to specify the `backward` function for that module.

We are using the Pytorch `DataLoader` class to create batches of training for the core training loop.

Typically introductory training loops use a number of epochs (each epoch is one pass over the training data). Instead, below the loop uses a fixed number of updates to the model and reports the loss after `show_loss` number of updates.

In [5]:
num_updates = 10000
show_loss = 100

def train():
    data, word_to_idx, idx_to_word = create_dataset()
    print("finished reading dataset")
    loss_func = nn.CrossEntropyLoss()
    model = CBOW(embedding_size=100, vocab_size=len(word_to_idx))
    optimizer = optim.Adam(model.parameters(), lr=1e-4) # also try SGD instead of Adam

    context_data = torch.tensor([instance.context for instance in data])
    output = torch.tensor([instance.mask_token for instance in data])

    # create dataset using the pytorch dataloader
    dataset = torch.utils.data.TensorDataset(context_data, output)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

    step = 0
    while step < num_updates:
        for context, true_output in dataloader:
            step += 1
            output = model(context)
            loss = loss_func(output, true_output)
            if (step % show_loss == 0):
                print(f"Step: {step}, Loss: {loss.item()}")
            if (step > num_updates):
                break
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return model, word_to_idx, idx_to_word

In [6]:
model, word_to_idx, idx_to_word = train()

finished reading dataset
Step: 100, Loss: 10.788529396057129
Step: 200, Loss: 10.80363941192627
Step: 300, Loss: 10.650333404541016
Step: 400, Loss: 10.618274688720703
Step: 500, Loss: 10.608075141906738
Step: 600, Loss: 10.478731155395508
Step: 700, Loss: 10.36941909790039
Step: 800, Loss: 10.465622901916504
Step: 900, Loss: 10.168929100036621
Step: 1000, Loss: 10.153223037719727
Step: 1100, Loss: 9.851362228393555
Step: 1200, Loss: 10.029329299926758
Step: 1300, Loss: 9.815364837646484
Step: 1400, Loss: 10.155341148376465
Step: 1500, Loss: 9.832701683044434
Step: 1600, Loss: 9.546302795410156
Step: 1700, Loss: 9.517840385437012
Step: 1800, Loss: 9.585627555847168
Step: 1900, Loss: 9.542594909667969
Step: 2000, Loss: 9.099220275878906
Step: 2100, Loss: 8.74914836883545
Step: 2200, Loss: 9.940362930297852
Step: 2300, Loss: 9.469222068786621
Step: 2400, Loss: 8.785650253295898
Step: 2500, Loss: 8.965746879577637
Step: 2600, Loss: 9.574028968811035
Step: 2700, Loss: 8.924840927124023
Ste

### Using the model

The model once trained can be used for the word embeddings it has learned. In the following function `get_k_closest_words` we return the words with the closest cosine similarity to some word we are interested in learning about.

Due to limitations of the size of training data and number of updates to the model being limited as well, the similar words list is not as compelling as with previously pre-trained word embeddings like word2vec or GLoVe.

In [7]:
import torch.nn.functional as F

def get_k_closest_words(word, k=10):
    embeddings = model.embed.weight.data
    word_idx = torch.tensor(word_to_idx[word], dtype=torch.int)
    word_embedding = embeddings[word_idx]
    similarities = F.cosine_similarity(embeddings, word_embedding)
    sorted_indices = torch.argsort(similarities, descending=True)
    top_k_idx = [i.item() for i in sorted_indices[1:k+1]]
    top_k = [idx_to_word[i] for i in top_k_idx]
    similarity_scores = [similarities[i].item() for i in top_k_idx]
    return top_k, similarity_scores

In [8]:
get_k_closest_words('say')

(['ter.',
  'suey',
  'bosch',
  'furrowed',
  'unrepentant',
  'diddling',
  'fatal',
  'pace',
  'motherhood',
  'footpath'],
 [0.3677217364311218,
  0.35909032821655273,
  0.35837218165397644,
  0.3529747426509857,
  0.35134974122047424,
  0.34932151436805725,
  0.34758907556533813,
  0.3465083837509155,
  0.3449290692806244,
  0.3435311019420624])

## Using gensim for word2vec

`gensim` uses SkipGram training for word2vec which can be trained on CPUs more effectively compared to the more neural network style training required for CBOW.

SkipGram training for word2vec only requires a dataset of positive and negative word pairs:

```
(target_word, context_word) TRUE/FALSE
```

The `TRUE` cases occur in the data with the `context_word` appearing in the context window. The `FALSE` cases are constructed using "negative sampling" which just means that we sample from the space of context words looking for words which are likely to be distractors or negatively correlated with the `target_word`.

The training data is constructed is this way so that all we need to do is train a binary classifier using this self supervised dataset. The end result is still the "hidden" embeddings learned while training this classifier.

In [9]:
import gensim.downloader as api
from gensim.models import Word2Vec
import multiprocessing

In [10]:
sentences = brown.sents()
print("\n".join([ " ".join(s) for s in sentences[:10]]))

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
It recommended that Fulton legislators act `` to have th

In [11]:
d = 100 # dimension of the word vectors
w2v_win2 = Word2Vec(sentences, vector_size=d, window=2, min_count=5, negative=15, epochs=10, workers=multiprocessing.cpu_count())
print("done training w2v_win2")

done training w2v_win2


In [12]:
w2v_win5 = Word2Vec(sentences, vector_size=d, window=5, min_count=5, negative=15, epochs=10, workers=multiprocessing.cpu_count())
print("done training w2v_win5")

done training w2v_win5


In [13]:
vecs = w2v_win2.wv
print(vecs.similar_by_word("Saturday")[:7])
print(vecs.similar_by_word("money")[:3])
print(vecs.similar_by_word("child")[:3])
print(vecs.similar_by_word("of")[:3])

[('Friday', 0.9059168100357056), ('Wednesday', 0.9003214240074158), ('Monday', 0.8943390846252441), ('Sunday', 0.8794779777526855), ('Tuesday', 0.8593041896820068), ('Thursday', 0.8391276001930237), ('November', 0.7774156928062439)]
[('fun', 0.7162140607833862), ('opportunity', 0.7057722210884094), ('ability', 0.7052724361419678)]
[('person', 0.810829758644104), ('patient', 0.7919821739196777), ('woman', 0.7917277812957764)]
[('plus', 0.5191760659217834), ('historical', 0.4734072983264923), ('concerning', 0.466562956571579)]


In [14]:
print(vecs.most_similar(positive=['12', '8'], negative=['5']))

[('13', 0.8990644216537476), ('21', 0.8850885033607483), ('11', 0.8766713738441467), ('17', 0.8641999363899231), ('31', 0.8622169494628906), ('22', 0.8568882942199707), ('September', 0.8564836382865906), ('9', 0.8564708828926086), ('24', 0.8557211756706238), ('16', 0.8555043339729309)]


In [58]:
model_gigaword = api.load("glove-wiki-gigaword-100")

In [59]:
for i, key in enumerate(model_gigaword.index_to_key):
    if i > 10:
        break
    print(key, model_gigaword[key])

the [-0.038194 -0.24487   0.72812  -0.39961   0.083172  0.043953 -0.39141
  0.3344   -0.57545   0.087459  0.28787  -0.06731   0.30906  -0.26384
 -0.13231  -0.20757   0.33395  -0.33848  -0.31743  -0.48336   0.1464
 -0.37304   0.34577   0.052041  0.44946  -0.46971   0.02628  -0.54155
 -0.15518  -0.14107  -0.039722  0.28277   0.14393   0.23464  -0.31021
  0.086173  0.20397   0.52624   0.17164  -0.082378 -0.71787  -0.41531
  0.20335  -0.12763   0.41367   0.55187   0.57908  -0.33477  -0.36559
 -0.54857  -0.062892  0.26584   0.30205   0.99775  -0.80481  -3.0243
  0.01254  -0.36942   2.2167    0.72201  -0.24978   0.92136   0.034514
  0.46745   1.1079   -0.19358  -0.074575  0.23353  -0.052062 -0.22044
  0.057162 -0.15806  -0.30798  -0.41625   0.37972   0.15006  -0.53212
 -0.2055   -1.2526    0.071624  0.70565   0.49744  -0.42063   0.26148
 -1.538    -0.30223  -0.073438 -0.28312   0.37104  -0.25217   0.016215
 -0.017099 -0.38984   0.87424  -0.72569  -0.51058  -0.52028  -0.1459
  0.8278    0.270

In [60]:
def print_wvec(k):
    print(k, model_gigaword[k])

print_wvec("australia")

australia [-0.51162    0.30543    0.90287   -0.2359     0.054233  -0.46955
 -0.58622    0.92663   -1.3481    -0.56388    0.16413   -0.2485
 -0.37083    0.68697   -0.010261  -0.36701    0.60389    0.26097
 -0.82303    0.3169     1.0727     0.26557    1.2684     0.40778
  0.59175    0.10629    1.046     -0.041425   0.21911    0.34432
 -0.93871    0.92486   -0.10104   -0.0258     0.62055   -0.2259
 -0.61513    0.076517  -1.3253    -0.47917   -1.536     -0.29047
  0.66274   -0.0037029  0.95209    0.35437    0.7889    -0.041788
  0.58202   -1.0529     0.087392   0.68862    0.11947    0.70129
 -0.25365   -2.2167    -0.80783   -0.75544    0.98496    0.15103
 -0.087487  -0.36205   -0.14925    0.24151    0.62069    0.38488
 -0.43094    0.68631    0.8661    -0.21534   -0.064854   0.52811
 -0.67989    0.074314   0.35535   -0.10512    0.27233   -0.15258
 -0.53039    0.11699    0.47877   -0.016485   0.41161    0.31436
 -0.2386    -0.79124   -0.28234    0.28075   -0.22179    0.60348
 -0.028632  -0.2

In [98]:
from sklearn.metrics.pairwise import cosine_similarity

def diff_wvec(a, b, c, d):
    c_vec = model_gigaword[c]
    diff_vec = model_gigaword[a] - model_gigaword[b] + model_gigaword[d]
    return cosine_similarity(diff_vec.reshape(1, -1), c_vec.reshape(1, -1))[0][0]

In [101]:
print('iraq-baghdad', diff_wvec('paris', 'france', 'baghdad', 'iraq'))
print('iraq-mosul', diff_wvec('paris', 'france', 'baghdad', 'mosul'))

iraq-baghdad 0.80485666
iraq-mosul 0.6276542


In [102]:
print('aus-sydney', diff_wvec('paris', 'france', 'sydney', 'australia'))
print('aus-canberra', diff_wvec('paris', 'france', 'canberra', 'australia'))

aus-sydney 0.8201804
aus-canberra 0.63979816


In [37]:
model_gigaword.most_similar("man", topn=10)

[('woman', 0.6998662352561951),
 ('person', 0.6443442106246948),
 ('boy', 0.620827853679657),
 ('he', 0.5926738977432251),
 ('men', 0.5819568634033203),
 ('himself', 0.5810033082962036),
 ('one', 0.5779522061347961),
 ('another', 0.5721587538719177),
 ('who', 0.5703631639480591),
 ('him', 0.5670831203460693)]

In [17]:
print(model_gigaword.most_similar(positive=['king', 'woman'], negative=['man']))

[('queen', 0.7698541283607483), ('monarch', 0.6843380331993103), ('throne', 0.6755736470222473), ('daughter', 0.6594556570053101), ('princess', 0.6520534157752991), ('prince', 0.6517034769058228), ('elizabeth', 0.6464518308639526), ('mother', 0.631171703338623), ('emperor', 0.6106470823287964), ('wife', 0.6098655462265015)]


In [18]:
print(model_gigaword.most_similar(positive=['12', '8'], negative=['5']))

[('16', 0.9776545763015747), ('14', 0.9762073755264282), ('13', 0.9723331332206726), ('17', 0.9680772423744202), ('19', 0.9613232016563416), ('22', 0.9609626531600952), ('15', 0.9585664868354797), ('21', 0.9549593925476074), ('11', 0.9540944695472717), ('23', 0.9537526965141296)]


## End

In [19]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()