# Seminar 1. Likelihood-based models.

This seminar will be about likelihood-based models: autoregressive and flow-based. Agenda:
- Likelihood model in 1D - fitting histogram using SGD (2 points)
- Deep Autoregressive model via Transformer on Shapes and Binarized MNIST (5 points)
- Conditional Autoregressive model via Transformer (3 points)



# Part 1. Fitting histogram.

In this part we will build our first likelihood-based model for 1D data and will try to fit it using gradient methods.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import torch
from torch import nn
from torch import optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
import math
from sklearn.model_selection import train_test_split
import random

Choose your device: don't forget to switch to GPU runtime when working in collab with cuda.

In [None]:
device = 'cuda'

First, we define the procedure of data generation. It will generate a dataset of samples $x \in \{0 \dots 99\}$

In [None]:
def sample_data():
    count = 10000
    rand = np.random.RandomState(0)
    a = 0.3 + 0.1 * rand.randn(count)
    b = 0.8 + 0.05 * rand.randn(count)
    mask = rand.rand(count) < 0.5
    samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
    
    return np.digitize(samples, np.linspace(0.0, 1.0, 100))

We generate data and perform train/val/test split.

In [None]:
data = sample_data()
train_data, test_data = train_test_split(data, test_size = 0.3)
train_data, val_data = train_test_split(train_data, test_size = 0.3)

Let's plot and visualize the histogram of training data!

In [None]:
def plot_histogram(data):
    counts = Counter(data)
    keys = list(counts.keys())
    values = list(counts.values())
    plt.bar(keys, values)
    plt.show()

plot_histogram(train_data)

On lecture we have discussed how to build histogram model. But this model is not the best choice for high-dimensional data. So, we suggesst to you to implement the following parametrized model:

$$ p_\theta(x)_i = \frac{e^{\theta_i}}{\sum_j{e^{\theta_j}}} $$

Where $\theta=(\theta_0 \dots \theta_{99})$

We propose you to implement this model in the following class

In [None]:
class SimpleProbabilityModel(nn.Module):
    # Store all parameters of your model as class fields in constructor
    def __init__(self,  num_elements=100):
        super(SimpleProbabilityModel, self).__init__()
        
        ################
        # YOUR CODE HERE
        ###############
        
    # Forward should return vector of log probabilities for each element
    def forward(self):
        ################
        # YOUR CODE HERE
        ###############
    
    # Should sample element using probabilities, obtained from parameters. Return single number 0..99
    def sample(self):
        ################
        # YOUR CODE HERE
        ###############

We will train this model using negative log-likelihood optimization: $ L_i = -\log p_{y_i} $. Implement this loss calculation for your model given a batch of data samples.

In [None]:
# data: n.array of numbers from your training distribution
# model: instance of your SimpleProbabilityModel.
# should return: negative log-likelihood of your data given the model to perform backpropagation
def calc_loss(data, model):
    ################
    # YOUR CODE HERE
    ###############

Finally, we can create instance of our model and perform training. Note that if your calculated previous loss as classic natural logarithm, here we scale it to binary logarithm for logging likelihood in bits (which is better for interpretation and comparisons).

In [None]:
model = SimpleProbabilityModel().to(device)

In [None]:
def train_simple_model(model, train_data, val_data, num_epochs=20000, batch_size=4000, lr=0.01):
    optimizer = optim.SGD(model.parameters(), lr=lr)
    train_losses = []
    val_losses = []
    for i in range(num_epochs):
        for j in range(len(train_data) // batch_size):
            optimizer.zero_grad()
            batch = train_data[batch_size * j:batch_size * (j + 1)]
            l = calc_loss(batch, model)
            train_losses.append(l.item() / math.log(2))
            l.backward()
            optimizer.step()
        l = calc_loss(val_data, model)
        val_losses.append(l.item() / math.log(2))
    
    print("Train NLL(bits)")
    plt.plot(train_losses, color='green')
    plt.show()

    print("Val NLL(bits)")
    plt.plot(val_losses, color='red')
    plt.show()
    
    print("Final validation NLL(bits): {}".format(val_losses[-1]))

In [None]:
train_simple_model(model, train_data, val_data)

You can also tune your training parameters (number of epochs, batch size, learning rate, optimizer), to improve validation NLL. You should obtain something below 6.

Finally, let's sample values from our model and visualize histograms of our test data and our sample data.

In [None]:
sampled_data = [model.sample().cpu().item() for _ in range(len(test_data))]

In [None]:
plot_histogram(sampled_data)
plot_histogram(test_data)

# Part 2. Transformer as universal autoregressive model

In this part, implement a simple Transformer architecture to model binary MNIST and shapes images

In [None]:
import pickle
from torchvision.utils import make_grid


def show_samples(samples, fname=None, nrow=10, title='Samples'):
    samples = torch.FloatTensor(samples)
    if len(samples.shape) == 3:
        samples = samples.unsqueeze(-1)
    samples = samples.permute(0, 3, 1, 2)
    grid_img = make_grid(samples, nrow=nrow)
    plt.figure()
    plt.title(title)
    plt.imshow(grid_img.permute(1, 2, 0))
    plt.axis('off')
    plt.show()
        

def load_data(fname, include_labels=False):
    with open(fname, 'rb') as data_file:
        data = pickle.load(data_file)

    train_data = (data['train'] > 127.5).astype(np.int32)
    test_data = (data['test'] > 127.5).astype(np.int32)
    if include_labels:
        return train_data, test_data, data['train_labels'], data['test_labels']
    
    return train_data, test_data

In [None]:
# For colab users: download file
# ! wget https://github.com/a4-edu/course_gmcv/raw/hw1/module1-likelihood/shapes.pkl

In [None]:
shapes_train, shapes_test, train_labels, test_labels = load_data('./shapes.pkl', True)
show_samples(shapes_train[:100, :, :])

In [None]:
shapes_train.shape

We recommend the following network design:

- Trainable PositionalEmbeddings
- N-layer Transformer Encoder (with causal mask)
- (!) norm_first=True
- logits as an output

In [None]:
class TransformerModel(nn.Module):
    def __init__(self, n_layers, d_model, num_tokens, max_len):
        super().__init__()
        ################
        # YOUR CODE HERE
        ###############

    def forward(self, x: torch.Tensor):
        seq_size, batch_size = x.shape
        positions = torch.arange(0, seq_size, 1, dtype=torch.long, device=x.device)
        ################
        # YOUR CODE HERE
        ###############

    def loss(self, x: torch.Tensor):
        # [seq_len, bs] -> [bs, seq_len]
        target = x[1:].transpose(0, 1)
        # [seq_len, bs, num_tokens] -> [bs, num_tokens, seq_len]
        ################
        # YOUR CODE HERE
        ###############


def sample(model: nn.Module, n_samples, start_token, out_len):
    # [seq_size, batch_size]
    output = torch.zeros(out_len + 1, n_samples, dtype=torch.long)
    output[0] = start_token
    output = output.to(device)

    model.eval()
    for t in range(out_len):
        with torch.no_grad():
            x = output
            # [batch_size, num_tokens]
            logits = model.forward(x)
            probs = F.softmax(logits, dim=-1)[t]
            next = torch.multinomial(probs, 1).squeeze(-1)
            output[t + 1] = next
    return output[1:]

In [None]:
H, W, _ = shapes_train[0].shape
shapes_model = TransformerModel(2, 128, 3, H * W + 1).to(device=device)

In [None]:
out = sample(shapes_model, 8, 0, H * W).cpu().numpy()
assert out.shape == (H * W, 8)

### Unconditional tokenizer

Implement simple unconditional image tokenizer: first element should be BOS, then your flattened image

- encoder accepts a single image and returns a sequence
- decoder accepts a single sequence WITHOUT leading BOS and returns an image

In [None]:
class ImageTokenizer:
    def __init__(self, height, width):
        self.bos = 2
        self.height = height
        self.width = width

    def encode(self, x: np.ndarray):
        ################
        # YOUR CODE HERE
        ###############

    def decode(self, x: np.ndarray):
        bos = (x == self.bos)
        if bos.sum() > 0:
            print(f"warning: bad trained model, all bos will be replaced to zero token")
            x[bos] = 0
        return # TODO

In [None]:
shapes_tokenizer = ImageTokenizer(H, W)
encoded = shapes_tokenizer.encode(shapes_train[10].squeeze(-1))
decoded = shapes_tokenizer.decode(encoded[1:])
assert np.allclose(shapes_train[10].squeeze(-1), decoded)

In [None]:
loss = shapes_model.loss(torch.tensor(encoded, dtype=torch.long, device=device).unsqueeze(1))

In [None]:
loss

In [None]:
class TokenizedDataset(Dataset):
    def __init__(self, X, _, tokenizer):
        super().__init__()
        self.X = X.squeeze(-1)
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.tokenizer.encode(self.X[index])

In [None]:
def train(model, train_loader, optimizer):
    model.train()
    train_losses = []
    for x in train_loader:
        x = x.transpose(0, 1)
        x = x.to(device=device, dtype=torch.long)
        loss = model.loss(x)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    return train_losses


def eval_loss(model, data_loader):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x in data_loader:
            x = x.transpose(0, 1)
            x = x.to(device=device, dtype=torch.long)
            loss = model.loss(x)
            total_loss += loss * x.shape[1]
        avg_loss = total_loss / len(data_loader.dataset)
    return avg_loss.item()


def train_epochs(model, train_loader, test_loader, train_args):
    epochs, lr = train_args['epochs'], train_args['lr']
    optimizer = optim.AdamW(model.parameters(), lr=lr)

    train_losses = []
    test_losses = [eval_loss(model, test_loader)]
    for epoch in range(epochs):
        print(f'epoch {epoch} started')
        model.train()
        train_losses.extend(train(model, train_loader, optimizer))
        test_loss = eval_loss(model, test_loader)
        test_losses.append(test_loss)
        print('train loss: {}, test_loss: {}'.format(np.mean(train_losses[-1000:]), 
                                                     test_losses[-1]))

    return train_losses, test_losses


def train_model(train_data, test_data, train_labels, test_labels, model, tokenizer, dataset_cls):
    """
    train_data: A (n_train, H, W, 1) uint8 numpy array of binary images with values in {0, 1}
    test_data: A (n_test, H, W, 1) uint8 numpy array of binary images with values in {0, 1}
    model: nn.Model item, should contain function loss
    tokenizer: ImageTokenizer or LabeledImageTokenizer instance
    dataset_cls: dataset constructor, should accept data, labels and tokenizer as arguments
    Returns
    - a (# of training iterations,) numpy array of train_losses evaluated every minibatch
    - a (# of epochs + 1,) numpy array of test_losses evaluated once at initialization and after each epoch
    - trained model
    """
    
    ################
    # YOUR CODE HERE
    ###############
    

In [None]:
H, W, _ = shapes_train[0].shape
tokenizer = # TODO
shapes_model = # TODO

train_losses, test_losses, shapes_model = train_model(shapes_train, shapes_test, train_labels, test_labels,
                                                      shapes_model, tokenizer, TokenizedDataset)

In [None]:
def show_train_plots(train_losses, test_losses, title):
    plt.figure()
    n_epochs = len(test_losses) - 1
    x_train = np.linspace(0, n_epochs, len(train_losses))
    x_test = np.arange(n_epochs + 1)

    plt.plot(x_train, train_losses, label='train loss')
    plt.plot(x_test, test_losses, label='test loss')
    plt.legend()
    plt.title(title)
    plt.xlabel('Epoch')
    plt.ylabel('NLL')
    plt.show()

In [None]:
show_train_plots(train_losses, test_losses, 'Shapes')

In [None]:
samples = sample(shapes_model, 100, tokenizer.bos, H * W)

In [None]:
decoded = np.zeros((100, H, W), dtype=np.int64)
################
# YOUR CODE HERE
###############

In [None]:
show_samples(decoded)

## Conditional generation

Let's try to train our autoregressive model with simple conditioning: instead of BOS token we'll use class token at start of our sequence

There are two things we need to change: our tokenizer and dataset

In [None]:
class LabeledImageTokenizer:
    def __init__(self, height, width, num_tokens=2):
        self.height = height
        self.width = width
        self.first_label = num_tokens

    def encode(self, x: np.ndarray, label: int):
        x = x.flatten()
        ################
        # YOUR CODE HERE
        ###############
        return out

    def encode_label(self, label):
        return # TODO

    def decode(self, x: np.ndarray):
        labels = (x > 1)
        if labels.sum() > 0:
            print(f"warning: bad trained model, all labels will be replaced to zero token")
            x[labels] = 0
        return x.reshape(self.height, self.width)


class TokenizedDatasetWithLabel(Dataset):
    def __init__(self, X, labels, tokenizer):
        super().__init__()
        self.X = X.squeeze(-1)
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.tokenizer.encode(self.X[index], self.labels[index])

In [None]:
n_labels = len(set(train_labels))

In [None]:
tokenizer = # TODO
H, W, _ = shapes_train[0].shape
shapes_model = # TODO

train_losses, test_losses, shapes_model = train_model(shapes_train, shapes_test, train_labels, test_labels, 
                                                      shapes_model, tokenizer, TokenizedDatasetWithLabel)

In [None]:
show_train_plots(train_losses, test_losses, 'Shapes-conditional')

In [None]:
samples = np.zeros((100, H, W))
n_samples = 100 // n_labels
for label in range(n_labels):
    first_token = tokenizer.encode_label(label)
    ################
    # YOUR CODE HERE
    ###############

In [None]:
show_samples(samples)

## Second dataset: MNIST

Ensure that your model and code are working for more complex dataset too

In [None]:
# For colab users: download file
# ! wget https://github.com/a4-edu/course_gmcv/raw/hw1/module1-likelihood/mnist.pkl

In [None]:
mnist_train, mnist_test, train_labels, test_labels = load_data('./mnist.pkl', True)

In [None]:
show_samples(mnist_train[:100])

In [None]:
H, W, _ = mnist_train[0].shape
model = TransformerModel(2, 128, 3, H * W + 1).to(device=device)
tokenizer = ImageTokenizer(H, W)
train_losses, test_losses, model = train_model(mnist_train, mnist_test, train_labels, test_labels, 
                                               model, tokenizer, TokenizedDataset)

In [None]:
show_train_plots(train_losses, test_losses, 'MNIST')

In [None]:
samples = sample(model, 100, tokenizer.bos, H * W)
decoded = np.zeros((100, H, W), dtype=np.int64)
################
# YOUR CODE HERE
###############

In [None]:
show_samples(decoded)

And conditional generation too

In [None]:
n_labels = len(set(train_labels))
tokenizer = LabeledImageTokenizer(H, W)
H, W, _ = mnist_train[0].shape
model = TransformerModel(2, 128, 2 + n_labels, H * W + 1).to(device=device)

train_losses, test_losses, model = train_model(mnist_train, mnist_test, train_labels, test_labels, 
                                               model, tokenizer, TokenizedDatasetWithLabel)

In [None]:
show_train_plots(train_losses, test_losses, 'MNIST-conditional')

In [None]:
samples = np.zeros((100, H, W))
n_samples = 100 // n_labels
for label in range(n_labels):
    first_token = tokenizer.encode_label(label)
    ################
    # YOUR CODE HERE
    ###############

In [None]:
show_samples(samples)