In this task you will implement ByteNet decoder https://arxiv.org/abs/1610.10099 for a language modeling task. Do to this you will write a custom layer for masked convolutions and define a residual blocks used in the architecture.

We will use the Hutter Prize dataset (https://en.wikipedia.org/wiki/Hutter_Prize, http://prize.hutter1.net/) that contains 100 megabytes of English wikipedia pages.

In [0]:
import numpy as np
import torch
import math
from tqdm import tqdm_notebook as tqdm
from torch import nn

disable_cuda = True

if not disable_cuda and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# ByteNet Architecture

**Task 1.** Define a masked convolution layer. This layer must multiply the convolutional filter with a binary mask that prohibits the network from looking into the subsequent tokens.

In [0]:
class MaskedConv1d(nn.Conv1d):
    def __init__(self, *args, **kwargs):
        super(MaskedConv1d, self).__init__(*args, **kwargs)
        # TODO: add mask
        pass

    def forward(self, x):
        # TODO: multiply the weights
        pass

If the layer implementation is correct, for kernel size 3 the gradient of the first output wrt input must be non-zero only for the first two tokens.

In [0]:
test_input = torch.rand(1, 3, 7, requires_grad=True)
masked_layer = MaskedConv1d(in_channels=3, out_channels=1, kernel_size=3)

In [0]:
if test_input.grad is not None:
    test_input.grad.data.zero_()
test_output = masked_layer(test_input)
test_output[0, 0, 0].backward()

In [0]:
print(test_input.grad)

The default implementation of layernorm normalizes wrt last dimensions. When we use convolutional architectures, the last dimension is time and the default implementation of LayerNorm can lead to leakage of information about the future tokens. 

To avoid this, we modify LayerNorm to normalize wrt the channel dimension (the last but one).

In [0]:
class ModifiedLayerNorm(nn.LayerNorm):
    def __init__(self, *args, **kwargs):
        super(ModifiedLayerNorm, self).__init__(*args, **kwargs)
    
    def forward(self, x):
        output = super(ModifiedLayerNorm, self).forward(x.transpose(2, 1))
        return output.transpose(2, 1)

**Task 2.** Use the masked convolution layer and the modified layer norm to define the residual block depicted in https://arxiv.org/abs/1610.10099 Figre 3 on the left.

Pay attention to padding parameter of masked convolution layer. Choose the padding parameter to preserve the length of the input sequence. Note that the correct padding parameter depends on the kernel size and the dilation parameter.

In [0]:
class ResidualBlock(nn.Module):
    def __init__(self, d, dilation, kernel_size):
        super(ResidualBlock, self).__init__()
        # TODO
        pass
        
    def forward(self, x):
        pass

In [0]:
test_input = b = torch.rand(1, 6, 17)
res_block = ResidualBlock(d=3, dilation=3, kernel_size=7)

In [0]:
# the output must be 17 no matter what dilation and kernel_size we choose
res_block(test_input).size(-1)

Now we combine the residual blocks into the decoder architecture. Section five of https://arxiv.org/abs/1610.10099 specifies the details of the architecture.

In [0]:
class ByteNetDecoder(nn.Module):
    def __init__(self, n_symbols, d, kernel_size, dilations, groupsize):
        super(ByteNetDecoder, self).__init__()
        dilations_in_groups = [d for d in dilations[::-1] for i in range(groupsize)]
        self.embedding = nn.Embedding(n_symbols, 2 * d) # to encode tokents into vectors of suitable dimensionality
        self.residual_blocks = nn.Sequential(
            *[ResidualBlock(d, dilation, kernel_size)
              for dilation in dilations_in_groups],
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Conv1d(2 * d, n_symbols, 1))

        self.d = d
        self.n_symbols = n_symbols
        self.kernel_size = kernel_size
        self.dilations = dilations
        self.groupsize = groupsize

    def forward(self, x):
        x_length = x.size(-1)
        x_embedded = self.embedding(x)
        x_embedded = x_embedded.transpose(-1, -2).view(-1, 2 * self.d, x_length)
        bytenet = self.residual_blocks(x_embedded)
        return bytenet.transpose(-1, -2).squeeze()

In particular, the decoder stacks several residual blocks with fixed dilation parameter:

In [0]:
groupsize = 5 # number of residual blocks in a group
dilations = [1, 2, 4, 8, 16] # dilation sizes within each group
dilations_in_groups = [d for d in dilations[::-1] for i in range(groupsize)]
print(dilations_in_groups)

In total, the model uses 25 residual blocks with 3 convolutional layers within each block.

# Applying the model to Hutter Prize dataset

Now the notebook assumes that you have an unpacked enwik8.zip file in the folder. You can download and unpack the data with the following commands:
!wget http://mattmahoney.net/dc/enwik8.zip
!unzip enwik8.zip

In [0]:
class HutterPrizeDataset():
    def __init__(self, filename, device):
        self.f, self.symbols = self._getFileAndUniqueSymbols(filename)
        # this split is typically considered in the literature
        self.ranges = {
            'train' : (0, math.floor(0.90 * len(self.f))),
            'validation': (math.floor(0.90 * len(self.f)),
                           math.floor(0.95 * len(self.f))),
            'test': (math.floor(0.95 * len(self.f)), len(self.f))}
        self.device = device

    def _getFileAndUniqueSymbols(self, filename):
        # In utf-8 encoding there are 6064 unique symbols, but in iso-8859-1 there are 205
        f = open(filename, 'r', encoding='iso-8859-1').read()
        unique_symbols = ''.join(sorted(set(f)))
        return f, unique_symbols

    def intEncoding(self, text):
        for i, t in enumerate(text):
            if self.symbols.find(t) is -1:
                raise NameError('Cannot process the string. '
                                'Symbol "%s" is not in the dataset' % t)
            self.encoding_tensor[i] = self.symbols.find(t)
        return self.encoding_tensor

    def intDecoding(self, tensor):
        text = ''
        for t in tensor:
            text += self.symbols[t]
        return text

    def iterate_batches(self, batchsize, type='train', encoding=True):
        n_steps = (self.ranges[type][1] - self.ranges[type][0]) // batchsize
        self.encoding_tensor = torch.zeros(batchsize, dtype=torch.long, device=self.device)
        for n in range(n_steps):
            if encoding:
                yield self.intEncoding(self.f[
                    self.ranges[type][0] + batchsize * n:
                    self.ranges[type][0] + batchsize * (n + 1)])
            else:
                yield self.f[self.ranges[type][0] + batchsize * n:
                             self.ranges[type][0] + batchsize * (n + 1)]

In [0]:
data = HutterPrizeDataset('enwik8', device)

In [0]:
# to simplify computation you can reduce width or depth of the network
decoder = ByteNetDecoder(
    n_symbols=len(data.symbols),
    d=512, # controls width
    kernel_size=3,
    dilations=[1, 2, 4, 8, 16],
    groupsize=5) # controls depth
decoder.to(device)

Try running the decoder on an arbitrary input to debug you implementations

# Train the network

If you impelementation is correct, the following script will train a ByteNet decoder. Try running it for several iterations.

- In practice, you should remember to shift the targets by one for during training.
- Note that we do not use the first 100 outputs during training. Why?
- The training procedure uses batch of size 1
- At every iteration the network updates the predictions for 400 symbols in the input sequence

In [0]:
def trainDecoder(decoder, dataset, n_epochs=4):
    train_losses = []
    validation_losses = []

    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(decoder.parameters(),
                                 lr=0.0003,
                                 weight_decay=1e-4)
    for epoch in range(n_epochs):
        decoder.train()
        running_train_loss = 0.
        n_its = (dataset.ranges['train'][1] - dataset.ranges['train'][0]) // 500
        print('Epoch #%d \tTotal number of iterations: %d' % (epoch, n_its))
        train_iterator = tqdm(enumerate(dataset.iterate_batches(500, 'train')))
        for t, batch in train_iterator:
            optimizer.zero_grad()
            target = batch[100:]
            logits = decoder(batch)[99:-1]
            loss = criterion(logits, target)
            loss.backward()
            optimizer.step()
            running_train_loss += (loss.item() - running_train_loss) / (t + 1)
            train_iterator.set_postfix(loss=running_train_loss)            
    
        train_losses.append(running_train_loss)

        n_its = (dataset.ranges['validation'][1] - dataset.ranges['validation'][0]) // 500
        print('Validation \t Total number of iterations: %d' % n_its)
        decoder.eval()
        running_val_loss = 0.
        val_iterator = tqdm(enumerate(dataset.iterate_batches(500, 'validation')))
        for t, batch in val_iterator:
            target = batch[100:]
            logits = decoder(batch)[100:]
            loss = criterion(logits, target)
            running_val_loss += (loss.item() - running_val_loss) / (t + 1)
            val_iterator.set_postfix(loss=running_val_loss)
            
        torch.save(decoder.state_dict(), 'checkpoint_epoch_%d' % epoch + 1)
            
        validation_losses.append(running_val_loss)

    return train_losses, validation_losses

In [0]:
trainDecoder(decoder, data)

** Task 3. ** Implement a function that samples the next character given the condition $x$. Train the model for several epochs and generate some samples using the trained model. If everything works properly, validation loss will be in [1., 2.]. Try extracts from the test set and arbitrary text as a condition.

In [0]:
def sample_next_symbol(decoder, x, T=1.):
    # TODO
    pass

In [0]:
# TODO

** Optional task 4. ** In the original papers the authors used a different block for language modeling. Implement the MU block from Figure 3. Compare the performance of the basic block and the MU block.