# word2vec

Consider a sentence:

`the general [MASK] the troops`

We want to predict the `[MASK]` token using the context words: `the`, `general`, `the`, `troops`.



In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import namedtuple

## CBOW

The CBOW model for word2vec stands for Continuous Bag of Words.

We take the average (mean) of the word vectors for the context words and use it to predict the target word (the word that can be viewed as the `[MASK]` token). 

This is an example of self-supervised learning because each `[MASK]` token is created from some text and so we know the right output word in each training example.

The original model used two different embeddings: one for the target words as they appear as `[MASK]` tokens and another for the context words used to predict the target word. 

Pytorch has a simple way to go from one-hot vectors based on word indices to word embeddings (aka word vectors) using the `nn.Embedding` class.

We create a hidden layer in this simple neural network by taking the mean of the context words. Pytorch forces you to think in terms of a batch of training/test examples at a time and it is typically the first dimension of the tensor we create. 

To take the mean of the input embeddings we start with the embeddings for all the inputs using a tensor of size `B x N x E` where `B` is the batch size, `N` is the number of inputs and `E` is the embedding size. We take the mean of dimension 1 (the `N` dimension) and we get a new tensor of size `B x 1 x E` which we can convert to a `B x E` tensor using the Pytorch `squeeze` function.

We pass this through a simple linear network (fully connected single hidden layer) to get the hidden representation which is then used to predict the target word by mapping it back up to the vocabulary size using `self.expand` in this implementation.

In [2]:
class CBOW(nn.Module):
    def __init__(self, embedding_size=100, vocab_size=-1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.expand = nn.Linear(embedding_size, vocab_size)

    def forward(self, inputs):
        hidden = self.embed(inputs).mean(1).squeeze(1) # batch size x embedding_size
        return self.expand(hidden)

### Create the dataset

The dataset creation reads the raw text data and creates the word to index and index to word dictionaries.

For a window size of 2 it creates each instance for training which has a `context` of a list of context words and the target `mask_token` which is the answer the neural network should predict to get zero loss.


In [3]:
from nltk.corpus import brown

brown.words()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [4]:
Instance = namedtuple('Instance', ['context', 'mask_token'])

def create_dataset():
    tokenized_text = [w.lower() for w in brown.words()]
    vocab = set(tokenized_text)
    word_to_idx = dict()
    idx_to_word = dict()

    # create word to index mapping and vice versa
    for i, word in enumerate(vocab):
        word_to_idx[word] = i
        idx_to_word[i] = word

    # generate training data using two words of context before and after [MASK] 
    data = []
    for i in range(2, len(tokenized_text)-2):
        context = [
            tokenized_text[i-2],
            tokenized_text[i-1],
            tokenized_text[i+1],
            tokenized_text[i+2],
        ]
        masked_token = tokenized_text[i]
        context_idxs = [word_to_idx[w] for w in context]
        mask_token_idx = word_to_idx[masked_token]
        # data is a list of context, mask_token tuples
        data.append(Instance(context=context_idxs, mask_token=mask_token_idx))
    
    return data, word_to_idx, idx_to_word

### Training the model

Training the model is a fairly standard loop in Pytorch which doesn't change much once the model is defined and it knows how to compute the forward pass through the model. The backward pass through the model for standard `nn` components in Pytorch is done automatically so you don't need to specify the `backward` function yourself. If you create a novel `nn.Module` component that does not use standard Pytorch modules then you will need to specify the `backward` function for that module.

We are using the Pytorch `DataLoader` class to create batches of training for the core training loop.

Typically introductory training loops use a number of epochs (each epoch is one pass over the training data). Instead, below the loop uses a fixed number of updates to the model and reports the loss after `show_loss` number of updates.

In [5]:
num_updates = 10000
show_loss = 100

def train():
    data, word_to_idx, idx_to_word = create_dataset()
    print("finished reading dataset")
    loss_func = nn.CrossEntropyLoss()
    model = CBOW(embedding_size=100, vocab_size=len(word_to_idx))
    optimizer = optim.Adam(model.parameters(), lr=1e-4) # also try SGD instead of Adam

    context_data = torch.tensor([instance.context for instance in data])
    output = torch.tensor([instance.mask_token for instance in data])

    # create dataset using the pytorch dataloader
    dataset = torch.utils.data.TensorDataset(context_data, output)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

    step = 0
    while step < num_updates:
        for context, true_output in dataloader:
            step += 1
            output = model(context)
            loss = loss_func(output, true_output)
            if (step % show_loss == 0):
                print(f"Step: {step}, Loss: {loss.item()}")
            if (step > num_updates):
                break
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return model, word_to_idx, idx_to_word

In [6]:
model, word_to_idx, idx_to_word = train()

finished reading dataset
Step: 100, Loss: 10.779905319213867
Step: 200, Loss: 10.759285926818848
Step: 300, Loss: 10.76062297821045
Step: 400, Loss: 10.60704517364502
Step: 500, Loss: 10.522197723388672
Step: 600, Loss: 10.408906936645508
Step: 700, Loss: 10.431928634643555
Step: 800, Loss: 10.2299165725708
Step: 900, Loss: 10.154336929321289
Step: 1000, Loss: 10.144543647766113
Step: 1100, Loss: 10.317975044250488
Step: 1200, Loss: 9.921103477478027
Step: 1300, Loss: 9.951505661010742
Step: 1400, Loss: 9.974474906921387
Step: 1500, Loss: 9.831195831298828
Step: 1600, Loss: 9.706671714782715
Step: 1700, Loss: 9.94244384765625
Step: 1800, Loss: 9.624051094055176
Step: 1900, Loss: 9.440933227539062
Step: 2000, Loss: 9.238410949707031
Step: 2100, Loss: 9.599311828613281
Step: 2200, Loss: 8.912700653076172
Step: 2300, Loss: 9.152775764465332
Step: 2400, Loss: 9.381567001342773
Step: 2500, Loss: 8.859772682189941
Step: 2600, Loss: 9.069343566894531
Step: 2700, Loss: 9.21623706817627
Step: 2

### Using the model

The model once trained can be used for the word embeddings it has learned. In the following function `get_k_closest_words` we return the words with the closest cosine similarity to some word we are interested in learning about.

Due to limitations of the size of training data and number of updates to the model being limited as well, the similar words list is not as compelling as with previously pre-trained word embeddings like word2vec or GLoVe.

In [7]:
import torch.nn.functional as F

def get_k_closest_words(word, k=10):
    embeddings = model.embed.weight.data
    word_idx = torch.tensor(word_to_idx[word], dtype=torch.int)
    word_embedding = embeddings[word_idx]
    similarities = F.cosine_similarity(embeddings, word_embedding)
    sorted_indices = torch.argsort(similarities, descending=True)
    top_k_idx = [i.item() for i in sorted_indices[1:k+1]]
    top_k = [idx_to_word[i] for i in top_k_idx]
    similarity_scores = [similarities[i].item() for i in top_k_idx]
    return top_k, similarity_scores

In [8]:
get_k_closest_words('say')

  'monosyllables',
  'flowering',
  'great-grandmother',
  'tyrosine',
  'guy',
  'addict',
  '43',
  'intensifier',
  'unshelled'],
 [0.39793434739112854,
  0.3897356688976288,
  0.37338927388191223,
  0.372646301984787,
  0.3687784671783447,
  0.36492371559143066,
  0.3563559353351593,
  0.3554399013519287,
  0.3554075360298157,
  0.3512455224990845])

## Using gensim for word2vec

`gensim` uses SkipGram training for word2vec which can be trained on CPUs more effectively compared to the more neural network style training required for CBOW.

SkipGram training for word2vec only requires a dataset of positive and negative word pairs:

```
(target_word, context_word) TRUE/FALSE
```

The `TRUE` cases occur in the data with the `context_word` appearing in the context window. The `FALSE` cases are constructed using "negative sampling" which just means that we sample from the space of context words looking for words which are likely to be distractors or negatively correlated with the `target_word`.

The training data is constructed is this way so that all we need to do is train a binary classifier using this self supervised dataset. The end result is still the "hidden" embeddings learned while training this classifier.

In [9]:
import gensim.downloader as api
from gensim.models import Word2Vec
import multiprocessing

In [10]:
sentences = brown.sents()
print("\n".join([ " ".join(s) for s in sentences[:10]]))

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
It recommended that Fulton legislators act `` to have th

In [11]:
d = 100 # dimension of the word vectors
w2v_win2 = Word2Vec(sentences, vector_size=d, window=2, min_count=5, negative=15, epochs=10, workers=multiprocessing.cpu_count())
print("done training w2v_win2")

done training w2v_win2


In [12]:
w2v_win5 = Word2Vec(sentences, vector_size=d, window=5, min_count=5, negative=15, epochs=10, workers=multiprocessing.cpu_count())
print("done training w2v_win5")

done training w2v_win5


In [13]:
vecs = w2v_win2.wv
print(vecs.similar_by_word("Saturday")[:7])
print(vecs.similar_by_word("money")[:3])
print(vecs.similar_by_word("child")[:3])
print(vecs.similar_by_word("of")[:3])

[('Friday', 0.9059444665908813), ('Wednesday', 0.8997581601142883), ('Monday', 0.8954061269760132), ('Sunday', 0.8842747807502747), ('Tuesday', 0.8583642244338989), ('Thursday', 0.8384281992912292), ('afternoon', 0.7763940095901489)]
[('fun', 0.7283743619918823), ('ability', 0.7089091539382935), ('opportunity', 0.7075211405754089)]
[('person', 0.803625762462616), ('woman', 0.7930218577384949), ('patient', 0.7916847467422485)]
[('plus', 0.5232049822807312), ('historical', 0.47552353143692017), ('concerning', 0.4683312773704529)]


In [14]:
print(vecs.most_similar(positive=['12', '8'], negative=['5']))

[('13', 0.9029800891876221), ('21', 0.880962610244751), ('11', 0.8787495493888855), ('31', 0.8624194264411926), ('17', 0.8619294166564941), ('16', 0.8615667223930359), ('22', 0.8604256510734558), ('24', 0.8600680828094482), ('9', 0.8574585914611816), ('18', 0.856285035610199)]


In [15]:
model_gigaword = api.load("glove-wiki-gigaword-100")

In [16]:
model_gigaword.most_similar("man", topn=10)

[('woman', 0.832349419593811),
 ('boy', 0.7914870977401733),
 ('one', 0.7788748741149902),
 ('person', 0.7526816725730896),
 ('another', 0.752223551273346),
 ('old', 0.7409117221832275),
 ('life', 0.7371697425842285),
 ('father', 0.7370322942733765),
 ('turned', 0.7347694635391235),
 ('who', 0.734551191329956)]

In [17]:
print(model_gigaword.most_similar(positive=['king', 'woman'], negative=['man']))

[('queen', 0.7698541283607483), ('monarch', 0.6843380331993103), ('throne', 0.6755736470222473), ('daughter', 0.6594556570053101), ('princess', 0.6520534157752991), ('prince', 0.6517034769058228), ('elizabeth', 0.6464518308639526), ('mother', 0.631171703338623), ('emperor', 0.6106470823287964), ('wife', 0.6098655462265015)]


In [18]:
print(model_gigaword.most_similar(positive=['12', '8'], negative=['5']))

[('16', 0.9776545763015747), ('14', 0.9762073755264282), ('13', 0.9723331332206726), ('17', 0.9680772423744202), ('19', 0.9613232016563416), ('22', 0.9609626531600952), ('15', 0.9585664868354797), ('21', 0.9549593925476074), ('11', 0.9540944695472717), ('23', 0.9537526965141296)]


## End

In [19]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()