# A Neural Probabilistic Language Model

This notebook implements a simple neural probabilistic language model [1] using PyTorch. The model is trained on 10k song titles from the Million Song Dataset [2] to predict the next word, given a fixed context window of three preceding words.

### References

1. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural Probabilistic Language Model,” Advances in neural information processing systems, vol. 13, 2000.
2. Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.  The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

## Data

We will use a small sample of 10k songs from the [Million Song Dataset](http://millionsongdataset.com/pages/getting-dataset/). In particular, we use only song titles and build a language model that can generate new, similar sounding titles. For convenience, the titles have been extracted to `data/song_titles.txt` file, which we will use throughout this notebook. The file contains one song title per new line.

In [1]:
import random

In [2]:
from typing import List

In [3]:
from tokenizers import Tokenizer, normalizers
from tokenizers.trainers import WordLevelTrainer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import StripAccents, Lowercase, NFD

In [4]:
with open("./data/song_titles.txt") as f:
    songs = f.read().split("\n")

### Tokenizer

For this example, we will use a simple word-level tokenizer, applying few basic data cleaning transformations. We leave more advance tokenizer architectures for future tutorials.

In [5]:
tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

In [6]:
trainer = WordLevelTrainer()
tokenizer.train_from_iterator(songs, trainer=trainer)

### Configuration

We collect few of the configuration variables pertaining to the input data.

In [7]:
seed = 6687945581
window_size = 3
train_chunk = 0.8
val_chunk = 0.2
vocab_size = tokenizer.get_vocab_size() + 1
empty_token_id = tokenizer.get_vocab_size()

### Dataset

We create a torch dataset comprising of the context words, number of which is specified by parameter `window_size`, and associated targets, representing the next word from the dataset.

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [9]:
random.seed(seed)
torch.manual_seed(seed);

In [10]:
def get_dataset(songs: List[str], window_size: int = 3) -> torch.utils.data.Dataset:
    x, y = [], []

    songs_encoded = tokenizer.encode_batch(songs)
    songs_encoded = [song.ids + [empty_token_id] for song in songs_encoded]

    for song in songs_encoded:
        window = [empty_token_id] * window_size
        for token in song:
            x += [window]
            y += [token]
            window = window[1:] + [token]

    x, y = torch.tensor(x), F.one_hot(torch.tensor(y), vocab_size).float()
    return torch.utils.data.TensorDataset(x, y)

In [11]:
random.shuffle(songs)
train_limit, val_limit = int(len(songs) * train_chunk), int(len(songs) * (train_chunk + val_chunk))
train_dataset = get_dataset(songs[:train_limit], window_size=window_size)
val_dataset = get_dataset(songs[train_limit:val_limit], window_size=window_size)
test_dataset = get_dataset(songs[val_limit:], window_size=window_size)

## Model

In this section, we build a simple neural network to predict a subsequent word given context. The architecture of the network is described in the original paper [1].

In [12]:
class NPLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int, hidden_dim: int, window_size: int, residual: bool = False):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.residual = residual
        self.window_size = window_size

        self.C = nn.Embedding(vocab_size, embedding_dim)
        self.H = nn.Linear(embedding_dim * window_size, hidden_dim)
        self.U = nn.Linear(hidden_dim, vocab_size)
        self.W = nn.Linear(embedding_dim * window_size, vocab_size) if residual else None
    
    def forward(self, x):
        x = self.C(x).view(-1, self.embedding_dim * self.window_size)
        embeddings = torch.tanh(self.H(x))
        return self.W(x) + self.U(embeddings) if self.residual else self.U(embeddings)

## Training

Next, we implement the network training and evaluation pipelines.

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = NPLanguageModel(vocab_size, embedding_dim=32, hidden_dim=128, residual=True, window_size=window_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[25, 50, 75], gamma=0.1)
epochs = 100
batch_size = 64

In [14]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0)

In [15]:
def epoch_train():
    losses = []

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        y_pred = model(x)
        loss = loss_fn(y_pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    
    return torch.tensor(losses).mean().item()

In [16]:
@torch.no_grad()
def epoch_eval(dataloader):
    losses = []

    for x, y in dataloader:
        x, y = x.to(device), y.to(device)
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        losses.append(loss.item())
    
    return torch.tensor(losses).mean().item()

In [None]:
for epoch in range(1, epochs + 1):
    train_loss = epoch_train()
    print(f"epoch={epoch}; train_loss={train_loss:.5f}")
    
    if epoch % 10 == 0:
        val_loss = epoch_eval(val_loader)
        print(f"epoch={epoch}; val_loss={train_loss:.5f}")

In [None]:
print(f"Test Loss: {epoch_eval(test_loader):.5f}")

## Generation

In the end, we sample few song titles using our trained model.

In [None]:
model = model.eval()
num_titles = 10

In [None]:
for i in range(num_titles):
    tokens = []
    generator = torch.Generator(device).manual_seed(seed + i)
    window = [empty_token_id] * window_size

    while True:
        window = torch.tensor([window]).to(device)

        logits = model(window)
        probs = F.softmax(logits, dim=1)
        token = torch.multinomial(probs, num_samples=1, generator=generator).item()

        if token == empty_token_id: break
        tokens.append(token)
        window = window.flatten().tolist()[1:] + [token]

    print(tokenizer.decode(tokens))