<img src='data/images/section-notebook-header.png' />

# Word2Vec: Skip-gram

Skip-gram is a model for training word embeddings in the Word2Vec framework. Unlike the CBOW model, which predicts a target word based on the context words, the Skip-gram model predicts context words given a target word. In the Skip-gram model, the training process involves sliding a window over a text corpus just like CBOW. However, instead of predicting the target word, the model predicts the surrounding context words. The objective is to maximize the probability of predicting the correct context words given the target word. The image below is taken from the lecture slides showing the basic setup and intuition behind Skip-gram.

<img src='data/images/lecture-slide-10.png' width='80%' />

This example assumes a window size of 2. This means that given a center word (here: *"movies"*), we want to predict 4 context words, 2 before and 2 after the center word. On the right, the image shows some example words, with the color indicating which words are intuitively the most likely context words (green = high probability; red = low probability). Of course, the actual most likely words will depend on the training data; here it is only about the basic intuition behind Skip-gram.

In this notebook, we will train a Skip-gram model from scratch. Since we already prepared the data in the accompanying notebook, there's actually not much more to do. We implement and train this model using PyTorch. The model should train with or without a GPU, although having a GPU significantly speeds up the process. However, here we don't care too much about accuracy but the basic idea behind Skip-gram.

## Setting up the Notebook

### Import Required Packages

In [None]:
import numpy as np
from tqdm import tqdm

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

We utilize some utility methods from PyTorch, so we need to import the `torch` package.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

We also need the PyTorch implementation of the Skip-gram model. While the code is very short, having the implementation in separate files makes it easier to re-use.

In [None]:
from src.word2vec import Skipgram
from src.utils import tsne_plot_similar_words

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU 
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

---

## Load all the Data

### Load Vocabulary

In the Data Preprocessing notebook we created the vocabulary to map from words to their indices, and vice versa. We naturally need this vocabulary.

In [None]:
vocabulary = torch.load('data/corpora/imdb-reviews/vectorized-word2vec/imdb-word2vec-20000.vocab')

vocab_size = len(vocabulary)

print('Size of vocabulary:\t{}'.format(vocab_size))

### Load Dataset

Of course, we need the training data. Depending on your size $m$ for the context (cf. Data Preprocessing notebook), there are $2m$ (context_word, center_word)-pairs for each center word and associated contexts.

In [None]:
data = np.load('data/corpora/imdb-reviews/vectorized-word2vec/imdb-dataset-skipgram.npy')

num_samples, num_indices = data.shape

print('Number of samples: {}'.format(num_samples))

### Split Dataset into Inputs & Targets

The input features `X` are the context word indices, and the targets are the center word indices. We also directly convert the Numpy arrays into PyTorch tensors to serve as input for the Skip-gram model.

In [None]:
X = torch.Tensor(data[:,0]).long()
y = torch.Tensor(data[:,-1]).long()

print(X.shape)
print(y.shape)

### Create `Dataset` and `DataLoader`

PyTorch comes with different `Dataset` classes and a `DataLoader` class that make working with batches of different sizes very easy.

In [None]:
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)

## Create and Train Skip-gram Model

### Create Model

Like CBOW, the Skip-gram model uses a shallow neural network with a single hidden layer. The input layer represents the target word, and the output layer represents the context words. The hidden layer acts as a continuous vector representation of the target word and is known as the word embedding layer. The word embeddings are learned by updating the weights of the neural network using backpropagation and gradient descent. The figure below is taken from the lecture slides to visualize the basic shallow architecture of CBOW by means of an example input and output.

<img src='data/images/lecture-slide-11.png' width='80%' />


In general, the Skip-gram model can be computationally more expensive compared to CBOW since it needs to predict multiple context words for each target word. It also requires a larger amount of training data to achieve robust word embeddings.

The code for the Skip-gram model can be found in `src/skipgram.py`. Have a look how simple the model looks. It directly implements the model they way we introduced in the lectures -- with some very minor tweaks to improve the training. As size for the word embeddings, we go with 300 by default -- feel free to change it -- as it is the common embedding size of pretrained Word2Vec models you can download.

In [None]:
embed_dim = 300

# Create model
model = Skipgram(vocab_size, embed_dim)
# Define loss function
criterion = nn.CrossEntropyLoss()
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Move th model to GPU, if available (by default it "stays" on the CPU)
model.to(device)
# Print model
print(model)

### Train Model

The code cell below shows the most basic structure for training a model -- basically identical to the one for CBOW. The outer loop determines how many epochs we want to train. An epoch describes the processing of all data samples in the dataset. For each epoch, we then loop over the dataset in the form of batches (this is where the `DataLoader` comes in handy). Instead of (Mini-Batch) Gradient Descent, we use the more sophisticated `Adam` optimizer, but feel free to change it to Gradient Descent which PyTorch also provides. The loss function -- often called "criterion" in Pytorch lingo -- is the Cross-Entropy Loss. Note that the model has **no Softmax layer**, as this is handled by the `CrossEntropyLoss` class of Pytorch. There is nothing special about it, but it does save 1 or 2 lines of code.

In [None]:
num_epochs = 0

for epoch in range(num_epochs):
    
    epoch_loss = 0.0
    
    for idx, (x, y) in enumerate(tqdm(dataloader)):
        # Move current batch to GPU, if available
        x, y = x.to(device), y.to(device)
        
        # Calculate output of the model
        logits = model(x)
        
        # Calculate loss
        loss = criterion(logits, y)
        
        # Reset the gradients from previous iteration
        model.zero_grad()
        
        # Calculate new Gradients using backpropagation
        loss.backward()

        #nn.utils.clip_grad_norm_(model.parameters(), 1)
        
        # Update all trainable parameters (i.e., the theta values of the model)
        optimizer.step()

        # Keep track of the overall loss of the epoch
        epoch_loss += loss.item()
            
    print('[Epoch {}] Loss: {}'.format((epoch+1), epoch_loss))

### Save/Load Model

As retraining the model all the time can be tedious, we can save and load our model.

In [None]:
action = 'save'
action = 'load'
#action = 'none'

if action == 'save':
    torch.save(model.state_dict(), 'data/models/word2vec/model-skipgram.pt')
elif action == 'load':
    model = Skipgram(vocab_size, embed_dim)
    model.to(device)
    model.load_state_dict(torch.load('data/models/word2vec/model-skipgram.pt'))
else:
    pass

---

## Visualization

The following code is purely to visualize the results. Of course, depending on how much of the training data you used and how long you have trained your model, the resulting plots might differ greatly from the ones in the lecture slides.

### Auxiliary Method

The method `get_most_similar()` below returns for a given word the k-most similar words w.r.t. the word embeddings. Note that we only use matrix `U` for the word embeddings, and completely ignore matrix `V`, just to keep it simple.

In [None]:
def get_most_similar(word, k=5):
    # Get the index for the input word
    idx = vocabulary.lookup_indices([word])[0]
    # Get the word vector of the input word
    reference = model.U.weight[idx]
    # Calculate all pairwise similarites between the input word vector and all other word vectors
    dist = F.cosine_similarity(model.U.weight, reference)
    # Sort the distances and return the top-k word vectors that are most similar to the input word vector
    # Note that the top-k contains the input word vector itself, which is fine here
    index_sorted = torch.argsort(dist, descending=True)
    indices = index_sorted[:k]
    # Convert the top-k nearest word vectors into their corresponding words
    return [ vocabulary.lookup_token(n.item()) for n in indices ]    
    
#Example
get_most_similar('dvd')

### Visualization of Results

We start by creating a list of seed words. For each seed word, we will get the top-k nearest words and later show them together into a 2d plot (see below). Feel free to change the list of seed words. Just note that each seed word and its resulting cluster will be assigned its unique color. So the more seed words you use, the less distinctive will be some of the colors in the final plot. You might also want to ensure that the seed words themselves are not semantically very similar.

In [None]:
seed_words = ['movie', 'actor', 'scene', 'music', 'dvd', 'story', 'horror', 'funny', 'laugh']

#### Create Word Embedding Clusters

Here, a cluster is simply the seed word and all its top-k nearest words. This helps us later to plot each cluster in a different color later.

In [None]:
clusters = {}

embedding_clusters = []
word_clusters = []

for word in seed_words:
    embeddings = []
    words = []
    for neighbor in get_most_similar(word):
        words.append(neighbor)
        embeddings.append(model.U.weight[vocabulary.lookup_indices([neighbor])[0]].detach().cpu().numpy())
    embedding_clusters.append(embeddings)
    word_clusters.append(words)
    
embedding_clusters = np.array(embedding_clusters)

#### Dimensionality Reduction

Our word embeddings are of size 300 (by default). This makes plotting them a bit tricky :). We therefore use a dimensionality reduction technique called T-SNE to map the word embeddings from the 300d space to a 2d space. A deeper discussion of T-SNE is beyond our scope here, but feel free to explore yourself how T-SNE works.

In [None]:
n, m, k = embedding_clusters.shape

tsne_model_en_2d = TSNE(perplexity=15, n_components=2, n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

#pca_model_en_2d = PCA(n_components=2).fit(embedding_clusters.reshape(n * m, k))
#embeddings_en_2d = np.array(pca_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

#### Plot Results

Lastly, the method `tsne_plot_similar_words()` implemented in the file `src.utils` plots our cluster of word embeddings that are now all in the 2d space. Again, the results very much depend on how much training data you used and how long you trained the model.

In [None]:
tsne_plot_similar_words('', seed_words, embeddings_en_2d, word_clusters, 1.0)

Assuming you have trained over the complete dataset for at least 10 epochs, the plot above should intuitive results where words with embeddings of the same color are indeed semantically related (e.g., the region/cluster containing the words *"music"*, *"tune"*, *"soundtrack"*, etc.). In general, the longer the training, the better the results, but even a few epochs should suffice here to get meaningful word embeddings.

However, having used the same corpus for training the CBOW and Skip-gram model, and assuming the same number of epochs during training, the CBOW result tends to look a bit more intuitive -- although it's difficult to quantify. This is due to the comment stated at the beginning saying that Skip-gram generally requires more data and/or longer training.

---

## Summary 

The Skip-gram model is a popular word embedding technique used in NLP. It aims to learn continuous vector representations of words by predicting the context words given a target word. In this approach, a sliding window is used to create training instances from a text corpus, where the target word is the input and the context words are the output. The Skip-gram model utilizes a shallow neural network with a single hidden layer. The input layer represents the target word, and the output layer represents the context words. The hidden layer serves as the word embedding layer, capturing the semantic relationships between words. By learning the co-occurrence patterns of words, the Skip-gram model can identify words with similar meanings or those that frequently appear together.

One advantage of the Skip-gram model is its ability to handle rare words effectively. Since each word in the corpus is treated as a target word during training, even infrequent words can have meaningful word embeddings. Additionally, the Skip-gram model can capture fine-grained nuances and semantic relationships, making it suitable for tasks such as word similarity and analogy detection. However, the Skip-gram model can be computationally expensive and requires a large amount of training data to achieve robust word embeddings. Training the model involves predicting multiple context words for each target word, which can increase the complexity. Despite these challenges, the Skip-gram model remains widely used and has demonstrated success in various NLP applications, including language modeling, information retrieval, and machine translation.

In summary, the Skip-gram model is a word embedding technique in NLP that predicts context words given a target word. It captures semantic relationships, handles rare words well, and enables fine-grained analysis of word meanings. While it can be computationally expensive and requires abundant training data, the Skip-gram model has proven valuable in numerous NLP tasks, showcasing its effectiveness and versatility.