<img src='data/images/section-notebook-header.png' />

# Word2Vec: CBOW

CBOW (Continuous Bag-of-Words) is a model for training word embeddings in the Word2Vec framework. CBOW aims to predict a target word based on its context words, making it a "bag-of-words" approach. In CBOW, the model architecture consists of a hidden layer that represents the word embeddings and an output layer that predicts the target word. The input to the model is a set of context words, and the output is the target word. The image below is taken from the lecture slides showing the basic setup and intuition behind CBOW.

<img src='data/images/lecture-slide-08.png' width='80%' />

This example assumes a window size of 2. This means that the context to consider for a given center word are the 2 words before and the 2 words after the center words. In the image above, these 4 context words are *"watching"*, *"funny"*, *"on"*, and *"netflix"* -- keep in mind that the order does not matter. On the right, the image shows some example words, with the color indicating which word is intuitively the most likely center word (green = high probability; red = low probability). Of course, the actual most likely word will depend on the training data; here it is only about the basic intuition behind CBOW.

In this notebook, we will train a CBOW model from scratch. Since we already prepared the data in the accompanying notebook, there's actually not much more to do. We implement and train this model using PyTorch. The model should train with or without a GPU, although having a GPU significantly speeds up the process. However, here we don't care too much about accuracy but the basic idea behind CBOW.

## Setting up the Notebook

### Import Required Packages

In [None]:
import numpy as np
from tqdm import tqdm

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

We utilize some utility methods from PyTorch, so we need to import the `torch` package.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

We also need the PyTorch implementation of the CBOW model. While the code is very short, having the implementation in separate files makes it easier to re-use.

In [None]:
from src.word2vec import CBOW
from src.utils import tsne_plot_similar_words

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU 
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

---

## Load all the Data

### Load Vocabulary

In the Data Preprocessing notebook we created the vocabulary to map from words to their indices, and vice versa. We naturally need this vocabulary.

In [None]:
vocabulary = torch.load('data/corpora/imdb-reviews/vectorized-word2vec/imdb-word2vec-20000.vocab')

vocab_size = len(vocabulary)

print('Size of vocabulary:\t{}'.format(vocab_size))

### Load Dataset

Of course, we need the training data. Recall, that each data sample is an array of word indices, not the words themselves. Depending on your size $m$ for the context (cf. Data Preprocessing notebook), a data sample contains $(2m + 1)$ indices, where the first $2m$ indices represent the context words and the last index represents the center word.

In [None]:
data = np.load('data/corpora/imdb-reviews/vectorized-word2vec/imdb-dataset-cbow.npy')

num_samples, num_indices = data.shape

print('Number of samples: {}'.format(num_samples))

### Split Dataset into Inputs & Targets

The input features `X` are the contexts (i.e., the first $2m$ entries), and the targets are the last entry in each data sample array. We also directly convert the Numpy arrays into PyTorch tensors to serve as input for the CBOW model.

In [None]:
X = torch.Tensor(data[:,0:-1]).long()
y = torch.Tensor(data[:,-1]).long()

print(X.shape)
print(y.shape)

### Create `Dataset` and `DataLoader`

PyTorch comes with different `Dataset` classes and a `DataLoader` class that make working with batches of different sizes very easy.

In [None]:
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

## Create and Train CBOW Model

### Create Model

CBOW belongs to the family of shallow neural network models and is used to represent words as dense vectors in a continuous vector space. CBOW aims to predict a target word based on the context words surrounding it. During training, CBOW constructs a hidden layer that represents the aggregated information from the context words. This hidden layer acts as a continuous vector representation of the context. The figure below is taken from the lecture slides to visualize the basic shallow architecture of CBOW by means of an example input and output

<img src='data/images/lecture-slide-09.png' width='80%' />

The word embeddings are learned by updating the weights of the neural network using backpropagation and gradient descent. The code for the CBOW model can be found in `src/cbow.py`. Have a look how simple the model looks. It directly implements the model visualized in the image above -- with some very minor tweaks to improve the training. As size for the word embeddings we go with 300 by default -- feel free to change it -- as it is the common embedding size of pretrained Word2Vec models you can download.

In [None]:
embed_dim = 300

# Create model
model = CBOW(vocab_size, embed_dim)
# Define loss function
criterion = nn.CrossEntropyLoss()
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Move th model to GPU, if available (by default it "stays" on the CPU)
model.to(device)
# Print model
print(model)

### Train Model

The code cell below shows the most basic structure for training a model. The outer loop determines how many epochs we want to train. An epoch describes the processing of all data samples in the dataset. For each epoch, we then loop over the dataset in the form of batches (this is where the `DataLoader` comes so handy). Instead of (Mini-Batch) Gradient Descent, we use the more sophisticated `Adam` optimizer, but feel free to change it to Gradient Descent which PyTorch also provides. The loss function -- often called "criterion" in Pytorch lingo -- is the Cross-Entropy Loss. Note that the model has **no Softmax layer**, as this is handled by the `CrossEntropyLoss` class of Pytorch. There is nothing special about it, but it does save 1 or 2 lines of code.

In [None]:
num_epochs = 10

for epoch in range(num_epochs):
    
    epoch_loss = 0.0
    
    for idx, (contexts, y) in enumerate(tqdm(dataloader)):
        # Move current batch to GPU, if available
        contexts, y = contexts.to(device), y.to(device)
        
        # Calculate output of the model
        logits = model(contexts)
        
        # Calculate loss
        loss = criterion(logits, y)
        
        # Reset the gradients from previous iteration
        model.zero_grad()
        
        # Calculate new Gradients using backpropagation
        loss.backward()

        # Update all trainable parameters (i.e., the theta values of the model)
        optimizer.step()

        # Keep track of the overall loss of the epoch
        epoch_loss += loss.item()
            
    print('[Epoch {}] Loss: {}'.format((epoch+1), epoch_loss))

### Save/Load Model

As retraining the model all the time can be tedious, we can save and load our model.

In [None]:
action = 'save'
#action = 'load'
#action = 'none'

if action == 'save':
    torch.save(model.state_dict(), 'data/models/word2vec/model-cbow.pt')
elif action == 'load':
    model = CBOW(vocab_size, embed_dim)
    model.to(device)
    model.load_state_dict(torch.load('data/models/word2vec/model-cbow.pt'))
else:
    pass

---

## Visualization

The following code is purely to visualize the results. Of course, depending on how much of the training data you used and how long you have trained your model, the resulting plots might differ greatly from the ones in the lecture slides.

### Auxiliary Method

The method `get_most_similar()` below returns for a given word the k-most similar words w.r.t. the word embeddings. Note that we only use matrix `U` for the word embeddings, and completely ignore matrix `V`, just to keep it simple.

In [None]:
def get_most_similar(word, k=5):
    # Get the index for the input word
    idx = vocabulary.lookup_indices([word])[0]
    # Get the word vector of the input word
    reference = model.U.weight[idx]
    # Calculate all pairwise similarites between the input word vector and all other word vectors
    dist = F.cosine_similarity(model.U.weight, reference)
    # Sort the distances and return the top-k word vectors that are most similar to the input word vector
    # Note that the top-k contains the input word vector itself, which is fine here
    index_sorted = torch.argsort(dist, descending=True)
    indices = index_sorted[:k]
    # Convert the top-k nearest word vectors into their corresponding words
    return [ vocabulary.lookup_token(n.item()) for n in indices ]    
    
#Example
get_most_similar('music')

### Visualization of Results

We start by creating a list of seed words. For each seed word, we will get the top-k nearest words and later show them together into a 2d plot (see below). Feel free to change the list of seed words. Just note that each seed word and its resulting cluster will be assigned its unique color. So the more seed words you use, the less distinctive will be some of the colors in the final plot. You might also want to ensure that the seed words themselves are not semantically very similar.

In [None]:
seed_words = ['movie', 'actor', 'scene', 'music', 'dvd', 'story', 'horror', 'funny', 'laugh', 'love', 'director']

#### Create Word Embedding Clusters

Here, a cluster is simply the seed word and all its top-k nearest words. This helps us later to plot each cluster in a different color later.

In [None]:
clusters = {}

embedding_clusters = []
word_clusters = []

for word in seed_words:
    embeddings = []
    words = []
    for neighbor in get_most_similar(word):
        words.append(neighbor)
        embeddings.append(model.U.weight[vocabulary.lookup_indices([neighbor])[0]].detach().cpu().numpy())
    embedding_clusters.append(embeddings)
    word_clusters.append(words)
    
embedding_clusters = np.array(embedding_clusters)

#### Dimensionality Reduction

Our word embeddings are of size 300 (by default). This makes plotting them a bit tricky :). We therefore use a dimensionality reduction technique called T-SNE to map the word embeddings from the 300d space to a 2d space. A deeper discussion of T-SNE is beyond our scope here, but feel free to explore yourself how T-SNE works.

In [None]:
%%time

n, m, k = embedding_clusters.shape

tsne_model_en_2d = TSNE(perplexity=15, n_components=2, n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

#### Plot Results

Lastly, the method `tsne_plot_similar_words()` implemented in the file `src.utils` plots our cluster of word embeddings that are now all in the 2d space. Again, the results very much depend on how much training data you used and how long you trained the model.

In [None]:
tsne_plot_similar_words('', seed_words, embeddings_en_2d, word_clusters, 1.0)

Assuming you have trained over the complete dataset for at least 10 epochs, the plot above should intuitive results where words with embeddings of the same color are indeed semantically related (e.g., the region/cluster containing the words *"music"*, *"tune"*, *"soundtrack"*, etc.). In general, the longer the training, the better the results, but even a few epochs should suffice here to get meaningful word embeddings.

---

## Summary

CBOW (Continuous Bag-of-Words) is a popular word embedding technique used in NLP. It belongs to the family of shallow neural network models and is used to represent words as dense vectors in a continuous vector space. CBOW aims to predict a target word based on the context words surrounding it. In CBOW, the training process involves creating a sliding window over a text corpus. The window size determines the number of context words considered. The model takes the context words as input and predicts the target word. This prediction is based on the learned word embeddings and the weights of the neural network. The objective of CBOW is to maximize the probability of predicting the correct target word given the context words.

During training, CBOW constructs a hidden layer that represents the aggregated information from the context words. This hidden layer acts as a continuous vector representation of the context. The word embeddings are learned by updating the weights of the neural network using backpropagation and gradient descent. CBOW has several advantages. It is computationally efficient compared to other word embedding techniques like Skip-gram, which considers each word as a target word and predicts the context words. CBOW is also known to work well for frequent words and is faster to train due to its simpler architecture. However, it may struggle with rare words or words with multiple meanings, as it does not capture the individual characteristics of each word.

In summary, CBOW is a shallow neural network model used for creating word embeddings. It predicts a target word based on the context words, learns dense vector representations for words, and aims to maximize the probability of correct predictions. CBOW is computationally efficient, works well for frequent words, but may not capture the nuances of rare or polysemous words. And after completing this notebook, you now know how to train word embeddings using CBOW yourself.