# A Neural Probabilistic Language Model

This notebook implements a simple neural probabilistic language model [1] using PyTorch. The model is trained on 10k song titles from the Million Song Dataset [2] to predict the next word, given a fixed context window of three preceding words.

### References

1. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural Probabilistic Language Model,” Advances in neural information processing systems, vol. 13, 2000.
2. Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.  The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

## Data

We will use a small sample of 10k songs from the [Million Song Dataset](http://millionsongdataset.com/pages/getting-dataset/). In particular, we use only song titles and build a language model that can generate new, similar sounding titles. You can read more about the structure of the dataset [here](http://millionsongdataset.com/pages/example-track-description/), and there are also some [useful code snippets](https://github.com/tbertinmahieux/MSongsDB/tree/master/PythonSrc).

In [1]:
import os
import tables
import random

In [2]:
from typing import List

In [3]:
from tokenizers import Tokenizer, normalizers
from tokenizers.trainers import WordLevelTrainer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import StripAccents, Lowercase, NFD

In [4]:
def get_song_titles(dataset_path, limit=None):
    result = []

    for root, _, files in os.walk(dataset_path):
        for file_name in files:
            file = tables.open_file(os.path.join(root, file_name))
            result += [f.decode("utf-8") for f in file.root.metadata.songs.cols.title]
            file.close()
            if limit and len(result) >= limit: break
    result = result if not limit else result[:limit]
    
    return result

In [5]:
dataset_path = "/mnt/storage/Development/Data/million_songs/million_songs_10k" # change to your dataset path
songs = get_song_titles(dataset_path=dataset_path)

### Tokenizer

For this example, we will use a simple word-level tokenizer, applying few basic data cleaning transformations. We leave more advance tokenizer architectures for future tutorials.

In [47]:
tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

In [48]:
trainer = WordLevelTrainer()
tokenizer.train_from_iterator(songs, trainer=trainer)

### Configuration

We collect few of the configuration variables pertaining to the input data.

In [61]:
seed = 6687945581
window_size = 3
train_chunk = 0.9
val_chunk = 0.1
vocab_size = tokenizer.get_vocab_size() + 1
empty_token_id = tokenizer.get_vocab_size()

### Dataset

We create a torch dataset comprising of the context words, number of which is specified by parameter `window_size`, and associated targets, representing the next word from the dataset.

In [62]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [63]:
random.seed(seed)
torch.manual_seed(seed);

In [64]:
def get_dataset(songs: List[str], window_size: int = 3) -> torch.utils.data.Dataset:
    x, y = [], []

    songs_encoded = tokenizer.encode_batch(songs)
    songs_encoded = [song.ids + [empty_token_id] for song in songs_encoded]

    for song in songs_encoded:
        window = [empty_token_id] * window_size
        for token in song:
            x += [window]
            y += [token]
            window = window[1:] + [token]

    x, y = torch.tensor(x), F.one_hot(torch.tensor(y), vocab_size).float()
    return torch.utils.data.TensorDataset(x, y)

In [65]:
random.shuffle(songs)
train_limit, val_limit = int(len(songs) * train_chunk), int(len(songs) * (train_chunk + val_chunk))
train_dataset = get_dataset(songs[:train_limit], window_size=window_size)
val_dataset = get_dataset(songs[train_limit:val_limit], window_size=window_size)

In [66]:
print(f"Vocab size: {vocab_size}")
print(f"Train dataset size: {len(train_dataset)}")
print(f"Val dataset size: {len(val_dataset)}")

Vocab size: 9805
Train dataset size: 47879
Val dataset size: 5370


## Model

In this section, we build a simple neural network to predict a subsequent word given context. The architecture of the network is described in the original paper [1].

In [67]:
class NPLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int, hidden_dim: int, window_size: int, residual: bool = False):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.residual = residual
        self.window_size = window_size

        self.C = nn.Embedding(vocab_size, embedding_dim)
        self.H = nn.Linear(embedding_dim * window_size, hidden_dim)
        self.U = nn.Linear(hidden_dim, vocab_size)
        self.W = nn.Linear(embedding_dim * window_size, vocab_size) if residual else None
    
    def forward(self, x):
        x = self.C(x).view(-1, self.embedding_dim * self.window_size)
        embeddings = torch.tanh(self.H(x))
        return self.W(x) + self.U(embeddings) if self.residual else self.U(embeddings)

## Training

Next, we implement the network training and evaluation pipelines.

In [68]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = NPLanguageModel(vocab_size, embedding_dim=32, hidden_dim=128, residual=True, window_size=window_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[25, 75, 150], gamma=0.1)
epochs = 200
batch_size = 64

In [69]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [70]:
def epoch_train():
    losses = []

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        y_pred = model(x)
        loss = loss_fn(y_pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    
    return torch.tensor(losses).mean().item()

In [71]:
@torch.no_grad()
def epoch_eval(dataloader):
    losses = []

    for x, y in dataloader:
        x, y = x.to(device), y.to(device)
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        losses.append(loss.item())
    
    return torch.tensor(losses).mean().item()

In [72]:
for epoch in range(1, epochs + 1):
    train_loss = epoch_train()
    print(f"epoch={epoch}; train_loss={train_loss:.5f}")
    
    if epoch % 10 == 0:
        val_loss = epoch_eval(val_loader)
        print(f"epoch={epoch}; val_loss={train_loss:.5f}")

epoch=1; train_loss=6.64286
epoch=2; train_loss=5.86656
epoch=3; train_loss=5.64640
epoch=4; train_loss=5.50319
epoch=5; train_loss=5.38981
epoch=6; train_loss=5.29672
epoch=7; train_loss=5.21287
epoch=8; train_loss=5.13335
epoch=9; train_loss=5.06213
epoch=10; train_loss=4.98731
epoch=10; val_loss=4.98731
epoch=11; train_loss=4.92671
epoch=12; train_loss=4.86029
epoch=13; train_loss=4.80187
epoch=14; train_loss=4.74224
epoch=15; train_loss=4.69029
epoch=16; train_loss=4.63780
epoch=17; train_loss=4.58716
epoch=18; train_loss=4.53418
epoch=19; train_loss=4.48587
epoch=20; train_loss=4.43955
epoch=20; val_loss=4.43955
epoch=21; train_loss=4.39750
epoch=22; train_loss=4.34973
epoch=23; train_loss=4.30722
epoch=24; train_loss=4.26751
epoch=25; train_loss=4.22838
epoch=26; train_loss=4.19013
epoch=27; train_loss=4.15285
epoch=28; train_loss=4.11508
epoch=29; train_loss=4.08168
epoch=30; train_loss=4.04791
epoch=30; val_loss=4.04791
epoch=31; train_loss=4.01612
epoch=32; train_loss=3.98418


## Generation

In the end, we sample few song titles using our trained model.

In [75]:
model = model.eval()
num_titles = 10
generator = torch.Generator(device).manual_seed(seed)

In [76]:
for i in range(num_titles):
    tokens = []
    window = [empty_token_id] * window_size

    while True:
        window = torch.tensor([window]).to(device)

        logits = model(window)
        probs = F.softmax(logits, dim=1)
        token = torch.multinomial(probs, num_samples=1, generator=generator).item()

        if token == empty_token_id: break
        tokens.append(token)
        window = window.flatten().tolist()[1:] + [token]

    print(tokenizer.decode(tokens))

firehouse old la
i want an angel
price never to be lonely
reminisce look dial you
stand down
the river
broke train
medley : fred
traveling air i dream
working on the head
