# Assignment 2. Language modeling.

This task is devoted to language modeling. Its goal is to write in PyTorch an RNN-based language model. Since word-based language modeling requires long training and is memory-consuming due to large vocabulary, we start with character-based language modeling. We are going to train the model to generate words as sequence of characters. During training we teach it to predict characters of the words in the training set.



## Task 1. Character-based language modeling: data preparation (15 points).

We train the language models on the materials of **Sigmorphon 2018 Shared Task**. First, download the Russian datasets.

In [None]:
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-train-high
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-dev
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-test

**1.1 (1 points)**
All the files contain tab-separated triples ```<lemma>-<form>-<tags>```, where ```<form>``` may contain spaces (*будете соответствовать*). Write a function that loads a list of all word forms, that do not contain spaces.  

In [342]:
import codecs
import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [343]:
def read_infile(infile):
    words = []
    with codecs.open(infile, "r", "utf_8_sig") as f:
        for line in f.readlines():
            line_words = line.lower().split('\t')
            if len(line_words) == 3 and ' ' not in line_words[1]:
                # I decided to include both lemma and form in dataset
                words += line_words[:-1]
    return words

In [344]:
train_words = read_infile("russian-train-high.txt")
dev_words = read_infile("russian-dev.txt")
test_words = read_infile("russian-test.txt")
print(len(train_words), len(dev_words), len(test_words))
print(*train_words[:10])

18426 1834 1844
валлонский валлонскому незаконченный незаконченным истрёпывать истрёпывав личный личного серьга серьгам


**1.2 (2 points)** Write a **Vocabulary** class that allows to transform symbols into their indexes. The class should have the method ```__call__``` that applies this transformation to sequences of symbols and batches of sequences as well. You can also use [SimpleVocabulary](https://github.com/deepmipt/DeepPavlov/blob/c10b079b972493220c82a643d47d718d5358c7f4/deeppavlov/core/data/simple_vocab.py#L31) from DeepPavlov. Fit an instance of this class on the training data.

In [345]:
from collections import Iterable

class SimpleVocabulary:
    def __init__(self):
        self.sym_to_ind = {}
        self.ind_to_sym = {}
    
    def fit(self, text):
        symbols = set(('<PAD>', '<BEG>', '<END>'))
        for word in text:
            symbols.update(word)
        
        self.sym_to_ind = {val: i for i, val in enumerate(symbols)}
        self.ind_to_sym = {v: k for k, v in self.sym_to_ind.items()}
    
    # we imply that each string in batch is exactly one symbol(incl. special)
    def __call__(self, batch):
        if isinstance(batch, Iterable) and not isinstance(batch, str):
            looked_up_batch = [self(sample) for sample in batch]
        else:
            return self.sym_to_ind[batch]

        return looked_up_batch
    
    def __len__(self):
        return len(self.sym_to_ind)

vocab = SimpleVocabulary()
vocab.fit([list(x) for x in train_words])
print(len(vocab))

37


**1.3 (2 points)** Write a **Dataset** class, which should be inherited from ```torch.utils.data.Dataset```. It should take a list of words and the ```vocab``` as initialization arguments.

In [347]:
import torch
from torch.utils.data import Dataset as TorchDataset

class Dataset(TorchDataset):
    
    """Custom data.Dataset compatible with data.DataLoader."""
    def __init__(self, data, vocab):
        self.data = data
        self.vocab = vocab

    def __getitem__(self, index):
        """
        Returns one tensor pair (source and target). The source tensor corresponds to the input word,
        with "BEGIN" and "END" symbols attached. The target tensor should contain the answers
        for the language model that obtain these word as input.        
        """
        word = ['<BEG>'] + list(self.data[index]) + ['<END>']
        source = torch.tensor(self.vocab(word[:-1]))
        target = torch.tensor(self.vocab(word[1:]))
        return source, target

    def __len__(self):        
        return len(self.data)

In [348]:
train_dataset = Dataset(train_words, vocab)
dev_dataset = Dataset(dev_words, vocab)
test_dataset = Dataset(test_words, vocab)

**1.4 (3 points)** Use a standard ```torch.utils.data.DataLoader``` to obtain an iterable over batches. Print the shape of first 10 input batches with ```batch_size=1```.

In [349]:
from torch.utils.data import DataLoader
from itertools import islice

train_data_loader = DataLoader(train_dataset, batch_size=1)
dev_data_loader = DataLoader(dev_dataset, batch_size=1)
test_data_loader = DataLoader(test_dataset, batch_size=1)


for x in islice(train_data_loader, 10):
    print(x[0].shape)

torch.Size([1, 11])
torch.Size([1, 12])
torch.Size([1, 14])
torch.Size([1, 14])
torch.Size([1, 12])
torch.Size([1, 11])
torch.Size([1, 7])
torch.Size([1, 8])
torch.Size([1, 7])
torch.Size([1, 8])


**(1.5) 1 point** Explain, why this does not work with larger batch size.

If batch_size > 1 then different words in batch may have different length. This implies their sources and targets may have different shape and thus they cannot be united in one tensor.

**(1.6) 5 points** Write a function **collate** that allows you to deal with batches of greater size. See [discussion](https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418/8) for an example. Implement your function as a class ```__call__``` method to make it more flexible.

In [350]:
def pad_tensor(vec, length, dim, pad_symbol):
    """
    Pads a vector ``vec`` up to length ``length`` along axis ``dim`` with pad symbol ``pad_symbol``.
    """
    pad_size = list(vec.shape)
    pad_size[dim] = length - vec.size(dim)
    return torch.cat([vec, pad_symbol 
                      * torch.ones(*pad_size, dtype=torch.long)], dim=dim)

class Padder:
    
    def __init__(self, dim=0, pad_symbol=0):
        self.dim = dim
        self.pad_symbol = pad_symbol
        
    def __call__(self, batch):
        # batch shape: (batch_size, 2, word_len)
        # find longest sequence
        max_len = max(map(lambda x: x[0].shape[self.dim], batch))
        # pad according to max_len
        batch = list(map(lambda x: 
                     (pad_tensor(x[0], max_len, self.dim, self.pad_symbol),
                     pad_tensor(x[1], max_len, self.dim, self.pad_symbol)), 
                    batch))
        # stack all
        xs = torch.stack(list(map(lambda x: x[0], batch)), dim=0)
        ys = torch.stack(list(map(lambda x: x[1], batch)), dim=0)

        return xs, ys


**(1.7) 1 points** Again, use ```torch.utils.data.DataLoader``` to obtain an iterable over batches. Print the shape of first 10 input batches with the batch size you like.

In [351]:
from torch.utils.data import DataLoader

padder = Padder(dim=0, pad_symbol=vocab.sym_to_ind['<PAD>'])
train_data_loader = DataLoader(train_dataset,
                               batch_size=1024, collate_fn=padder)
dev_data_loader = DataLoader(dev_dataset,
                               batch_size=1024, collate_fn=padder)
test_data_loader = DataLoader(test_dataset,
                               batch_size=1024, collate_fn=padder)


for x in islice(train_data_loader, 10):
    print(x[0].shape)

torch.Size([1024, 20])
torch.Size([1024, 28])
torch.Size([1024, 21])
torch.Size([1024, 22])
torch.Size([1024, 21])
torch.Size([1024, 26])
torch.Size([1024, 21])
torch.Size([1024, 23])
torch.Size([1024, 22])
torch.Size([1024, 21])


## Task 2. Character-based language modeling. (35 points)

**2.1 (5 points)** Write a network that performs language modeling. It should include three layers:
1. **Embedding** layer that transforms input symbols into vectors.
2. An **RNN** layer that outputs a sequence of hidden states (you may use https://pytorch.org/docs/stable/nn.html#gru).
3. A **Linear** layer with ``softmax`` activation that produces the output distribution for each symbol.

In [452]:
import torch.nn as nn

class RNNLM(nn.Module):

    def __init__(self, vocab_size, embeddings_dim, hidden_size):
        super(RNNLM, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, embeddings_dim)
        self.gru = nn.GRU(embeddings_dim, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.softmax = nn.Softmax(dim=2)
                
    def forward(self, inputs, hidden=None):
        emb = self.emb(inputs)
        # save last hidden_state
        gru, self.last_hidden = self.gru(emb, hidden)

        lin = self.linear(gru)
        return lin
        # softmax is done in CrossEntropyLoss
#         return self.softmax(lin)

**2.2 (1 points)** Write a function ``validate_on_batch`` that takes as input a model, a batch of inputs and a batch of outputs, and the loss criterion, and outputs the loss tensor for the whole batch. This loss should not be normalized.

In [453]:
def validate_on_batch(model, criterion, x, y):
    y_pred = model(x)
    vocab_size = y_pred.shape[2]
    return criterion(y_pred.reshape((-1, vocab_size)), y.reshape(-1))

**2.3 (1 points)** Write a function ``train_on_batch`` that accepts all the arguments of ``validate_on_batch`` and also an optimizer, calculates loss and makes a single step of gradient optimization. This function should call ``validate_on_batch`` inside.

In [454]:
def train_on_batch(model, criterion, x, y, optimizer):
    loss = validate_on_batch(model, criterion, x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss

**2.4 (3 points)** Write a training loop. You should define your ``RNNLM`` model, the criterion, the optimizer and the hyperparameters (number of epochs and batch size). Then train the model for a required number of epochs. On each epoch evaluate the average training loss and the average loss on the validation set. 

**2.5 (3 points)** Do not forget to average your loss over only non-padding symbols, otherwise it will be too optimistic.

In [457]:
embeddings_dim = 30
hidden_size = 50 
num_epochs = 300
batch_size = 128
learning_rate = 1e-3

padder = Padder(dim=0, pad_symbol=vocab.sym_to_ind['<PAD>'])
train_data_loader = DataLoader(train_dataset,
                               batch_size=batch_size, collate_fn=padder)
model = RNNLM(len(vocab), embeddings_dim, hidden_size)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = torch.nn.CrossEntropyLoss(ignore_index=vocab.sym_to_ind['<PAD>'])

In [458]:
%%time
loss_history = []

for epoch in range(num_epochs):
    epoch_loss = []
    for batch in train_data_loader:
        loss = train_on_batch(model, criterion, batch[0], batch[1], optimizer)
        epoch_loss.append(float(loss))
        
    # save average loss of each epoch        
    loss_history.append(np.mean(epoch_loss))

Wall time: 28min 27s


**2.6 (5 points)** Write a function **predict_on_batch** that outputs letter probabilities of all words in the batch.

In [459]:
def predict_on_batch(model, batch):
    with torch.no_grad():
        pred = nn.functional.softmax(model(batch), dim=2)
    return pred

**2.7 (1 points)** Calculate the letter probabilities for all words in the test dataset. Print them for 20 last words. Do not forget to disable shuffling in the ``DataLoader``.

In [460]:
test_data_loader = DataLoader(test_dataset,
                               batch_size=1, collate_fn=padder)

In [461]:
for i, batch in enumerate(test_data_loader):
   
        word = [vocab.ind_to_sym[int(sym)] for sym in batch[1][0]]
        probs = predict_on_batch(model, batch[0])
        word_probs = [round(float(probs[0, i, batch[1][0, i]]), 3)
                      for i in range(len(batch[1][0]))]
        
        if i > len(test_data_loader) - 21:
            print(i, 'word:', ''.join(word))
            print(*list(zip(word, word_probs)))

1824 word: общеобразовательный<END>
('о', 0.071) ('б', 0.267) ('щ', 0.032) ('е', 0.382) ('о', 0.002) ('б', 0.105) ('р', 0.518) ('а', 0.398) ('з', 0.045) ('о', 0.015) ('в', 0.451) ('а', 0.69) ('т', 0.59) ('е', 0.04) ('л', 0.951) ('ь', 0.913) ('н', 0.921) ('ы', 0.599) ('й', 0.721) ('<END>', 0.999)
1825 word: общеобразовательного<END>
('о', 0.071) ('б', 0.267) ('щ', 0.032) ('е', 0.382) ('о', 0.002) ('б', 0.105) ('р', 0.518) ('а', 0.398) ('з', 0.045) ('о', 0.015) ('в', 0.451) ('а', 0.69) ('т', 0.59) ('е', 0.04) ('л', 0.951) ('ь', 0.913) ('н', 0.921) ('о', 0.262) ('г', 0.296) ('о', 0.981) ('<END>', 0.98)
1826 word: фригидный<END>
('ф', 0.01) ('р', 0.13) ('и', 0.062) ('г', 0.09) ('и', 0.086) ('д', 0.001) ('н', 0.532) ('ы', 0.672) ('й', 0.738) ('<END>', 1.0)
1827 word: фригидной<END>
('ф', 0.01) ('р', 0.13) ('и', 0.062) ('г', 0.09) ('и', 0.086) ('д', 0.001) ('н', 0.532) ('о', 0.293) ('й', 0.333) ('<END>', 0.999)
1828 word: безмолвный<END>
('б', 0.04) ('е', 0.28) ('з', 0.356) ('м', 0.052) ('о'

We can see that probabilities are pretty high for most words, also `<END>` symbol has very high probability at the end of words, so our language model knows when the word should be finished.

**2.8 (5 points)** Write a function that generates a single word (sequence of indexes) given the model. Do not forget about the hidden state! Be careful about start and end symbol indexes. Use ``torch.multinomial`` for sampling.

In [470]:
def generate(model, max_length=20, start_index=1, end_index=2):
    prev_index = start_index
    word = []
    hidden = None
    for _ in range(max_length):
        # using batch of one letter
        batch = torch.tensor(prev_index).reshape((1, 1))
        probs = torch.nn.functional.softmax(model(batch, hidden), dim=2)
        new_index = int(torch.multinomial(probs.reshape(-1), 1))
        word.append(new_index)
        if new_index == end_index:
            break
            
        prev_index = new_index
        hidden = model.last_hidden
    
    return word

**2.9 (1 points)** Use ``generate`` to sample 20 pseudowords. Do not forget to transform indexes to letters.

In [467]:
for i in range(20):
    word = generate(model, start_index=vocab.sym_to_ind['<BEG>'],
                    end_index=vocab.sym_to_ind['<END>'])
    word = ''.join([vocab.ind_to_sym[ind] for ind in word])
    print(word)

хригейцок<END>
блатятетька<END>
потадажка<END>
строесь<END>
перефермуав<END>
ответника<END>
мочажка<END>
двтождах<END>
изучитемия<END>
непнитарованный<END>
выматечны<END>
ладие<END>
алкнивых<END>
опода<END>
мокатять<END>
всхажтный<END>
дамляю<END>
свойственного<END>
проболками<END>
изнойный<END>


  import sys


Symbol `<END>` was left in word to indicate that language model chose to finish the word on its own. We can see that the words look pretty similar to real russian words, we even got one real word -- "свойственного".

**(2.10) 5 points** Write a batched version of the generation function. You should sample the following symbol only for the words that are not finished yet, so apply a boolean mask to trace active words.

In [556]:
def generate_batch(model, batch_size, max_length = 20, start_index=1,
                   end_index=2):
    prev_index = start_index * torch.ones((batch_size, 1), dtype=torch.long)
    # pad small words
    words = vocab.sym_to_ind['<PAD>'] * torch.ones(
        (batch_size, max_length), dtype=torch.long
    ) 
    
    finished_mask = torch.zeros(batch_size, dtype=torch.bool)
    hidden = None
    for i in range(max_length):
        probs = torch.nn.functional.softmax(
            model(prev_index, hidden), dim=2
        )
        new_index = torch.multinomial(probs.reshape((batch_size, -1)), 1)
        # adding only to unfinished words
        append_ind = ~finished_mask
        words[append_ind, i] = new_index.reshape(-1)[append_ind]
        #update mask
        finished_mask = finished_mask | (new_index.reshape(-1) == end_index)
        if finished_mask.all():
            break
            
        prev_index = new_index
        hidden = model.last_hidden
    
    return words

In [557]:
generated = []
for _ in range(2):
    generated += generate_batch(model, batch_size=10,
                                start_index=vocab.sym_to_ind['<BEG>'],
                                end_index=vocab.sym_to_ind['<END>'])

transformed = [[vocab.ind_to_sym[int(ind)] 
                for ind in word if ind != vocab.sym_to_ind['<PAD>']]
               for word in generated]

for elem in transformed:
    print("".join(elem))

заморальникой<END>
рвкивает<END>
задальных<END>
нотранских<END>
гахстёрный<END>
лесорогазный<END>
антутка<END>
подпостождённом<END>
сондеми<END>
социотить<END>
двоежу<END>
исмитаительский<END>
чревого<END>
внедженный<END>
необ<END>
выдаем<END>
отчарвье<END>
склозном<END>
паскать<END>
непрепроскивать<END>


**(2.11) 5 points** Experiment with the type of RNN, number of layers, units and/or dropout to improve the perplexity of the model.

In [562]:
import torch.nn as nn

class RNNLM_experiment(nn.Module):

    def __init__(self, vocab_size, embeddings_dim, hidden_size, gru_kwargs={}):
        super(RNNLM_experiment, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, embeddings_dim)
        self.gru = nn.GRU(embeddings_dim, hidden_size, batch_first=True,
                         **gru_kwargs)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.softmax = nn.Softmax(dim=2)
        self.gru_kwargs = gru_kwargs
                
    def forward(self, inputs, hidden=None):
        emb = self.emb(inputs)
        # save last hidden_state
        gru, self.last_hidden = self.gru(emb, hidden)

        lin = self.linear(gru)
        return lin
        # softmax is done in CrossEntropyLoss
#         return self.softmax(lin)

In [569]:
embeddings_dim = 30
hidden_size = 50 
num_epochs = 200
batch_size = 128
learning_rate = 1e-3

num_layers = 3
dropout = 0.5
gru_kwargs = {
    'num_layers': num_layers,
    'dropout': dropout
}

padder = Padder(dim=0, pad_symbol=vocab.sym_to_ind['<PAD>'])
train_data_loader = DataLoader(train_dataset,
                               batch_size=batch_size, collate_fn=padder)
model_exp = RNNLM_experiment(len(vocab), embeddings_dim, hidden_size,
                             gru_kwargs)
optimizer = torch.optim.Adam(model_exp.parameters(), lr=learning_rate)
criterion = torch.nn.CrossEntropyLoss(ignore_index=vocab.sym_to_ind['<PAD>'])

In [570]:
%%time
loss_history = []

for epoch in range(num_epochs):
    epoch_loss = []
    for batch in train_data_loader:
        loss = train_on_batch(model_exp, criterion, batch[0], batch[1],
                              optimizer)
        epoch_loss.append(float(loss))
        
    # save average loss of each epoch        
    loss_history.append(np.mean(epoch_loss))

Wall time: 39min 41s


In [577]:
generated = []
for _ in range(2):
    generated += generate_batch(model_exp, batch_size=10,
                                start_index=vocab.sym_to_ind['<BEG>'],
                                end_index=vocab.sym_to_ind['<END>'])

transformed = [[vocab.ind_to_sym[int(ind)] 
                for ind in word if ind != vocab.sym_to_ind['<PAD>']]
               for word in generated]

for elem in transformed:
    print("".join(elem))

намовомь<END>
объад<END>
хос<END>
ликикари<END>
обестравывать<END>
обеничитель<END>
хонковданный<END>
лесной<END>
плоногинты<END>
сацогись<END>
нисяс<END>
сульникольный<END>
плотромных<END>
верагис<END>
провотерев<END>
аоголивайтесь<END>
йожрёрном<END>
свлодим<END>
дисера<END>
прикнуть<END>


It seems that model is underfitted, because of increased perplexity(it has now more parameters). Maybe we should train it more and decrease number of layers.

In [578]:
embeddings_dim = 30
hidden_size = 50 
num_epochs = 300
batch_size = 128
learning_rate = 1e-3

num_layers = 2
dropout = 0.5
gru_kwargs = {
    'num_layers': num_layers,
    'dropout': dropout
}

padder = Padder(dim=0, pad_symbol=vocab.sym_to_ind['<PAD>'])
train_data_loader = DataLoader(train_dataset,
                               batch_size=batch_size, collate_fn=padder)
model_exp = RNNLM_experiment(len(vocab), embeddings_dim, hidden_size,
                             gru_kwargs)
optimizer = torch.optim.Adam(model_exp.parameters(), lr=learning_rate)
criterion = torch.nn.CrossEntropyLoss(ignore_index=vocab.sym_to_ind['<PAD>'])

In [579]:
%%time
loss_history = []

for epoch in range(num_epochs):
    epoch_loss = []
    for batch in train_data_loader:
        loss = train_on_batch(model_exp, criterion, batch[0], batch[1],
                              optimizer)
        epoch_loss.append(float(loss))
        
    # save average loss of each epoch        
    loss_history.append(np.mean(epoch_loss))

Wall time: 42min 28s


In [582]:
generated = []
for _ in range(2):
    generated += generate_batch(model_exp, batch_size=10,
                                start_index=vocab.sym_to_ind['<BEG>'],
                                end_index=vocab.sym_to_ind['<END>'])

transformed = [[vocab.ind_to_sym[int(ind)] 
                for ind in word if ind != vocab.sym_to_ind['<PAD>']]
               for word in generated]

for elem in transformed:
    print("".join(elem))

посведил<END>
каяма<END>
ковит<END>
суммунен<END>
пподстлельный<END>
осными<END>
высчитать<END>
обленеческимо<END>
анфонка<END>
кликоходный<END>
окенный<END>
располител<END>
безрабомный<END>
кала<END>
отквожелывать<END>
довадать<END>
стрятось<END>
преплескат<END>
озтоженный<END>
голбочном<END>


Now the words are better, we even get some real ones. Probably with more data and more layers we could achieve better results.