# ANLP 2020 - Assignment 3

*Sophia Student, 1234567* (enter your name/student id number here)

<div class="alert alert-block alert-danger">Due: Wednesday, January 20, 2021, 23:59</div>

<div class="alert alert-block alert-info">

**NOTE**
<br>

Please first fill in your name and id number at the top of the assignment, and **rename** the assignment file to **yourlastname-anlp-3.ipynb**<br><br>
Problems and questions are given in blue boxes like this one. All grey and white boxes marked by the comment "#student solution/discussion here" must be filled by you (they either require code or a (brief!) discussion). <br><br>
Please hand in your assignment by the deadline via Moodle. In case of questions, you can contact the TAs and the instructors via the usual channels.

</div>

<div class="alert alert-block alert-info">
    
In this assignment, you will implement an LSTM model and train it to generate text, one character at a time. (So note: We're asking you to create a character-level model; in the lectures, we've so far only seen word-level models. Think about what the difference is, and what its practical consequences are.)

For training, we prepared two text files (train and test) containing passages from Charles Dickens' novels (dickens_train.txt, dickens_test.txt).

You should use the PyTorch machine learning library to implement this exercise.

- Instructions to install PyTorch can be found here: <http://pytorch.org/>
- The introductory tutorial we prepared for PyTorch is attached to the assignment: pytorch_lecture_2019.ipynb
- Some PyTorch examples for an in depth overview: <https://github.com/jcjohnson/pytorch-examples>
- Another common quickstart tutorial is this [PyTorch 60 Minutes Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

(But don't get carried away: For this assignment, you mostly need the very straightforward elements from the `nn` module in PyTorch that implement the layers that you've learned about, such as RNNS, LSTMs, embeddings.)

This assignment is designed to be runnable on a decent CPU. With 2 layers LSTM and hidden size of 128, it takes ~20 minutes to train while with hidden size of 512, it takes ~2 hours. Please take this into consideration while doing this assignment. 

Alternatively you can also use Google Colab <https://colab.research.google.com/> by uploading your notebook there, which gives you access to a GPU. (Check that you are indeed using the GPU, via `print(torch.cuda.is_available()`.) However, please keep mind as there is limitation for the free edition (i.e. 'maximum lifetime' of 12 hours).


The goal of this assignment is to get you to specify a simple network, and play around with its hyperparameters to explore how they affect the output. This is why we're providing you with a lot of code, to ensure that the basic housekeeping is taken care of.

</div>

# Prepare data

The file we are using is a plain text file. We turn any potential unicode characters into plain ASCII by using the `unidecode` package (which you can install via `pip` or `conda`). (What do you think is the use of this step?)

In [None]:
import unidecode
import string
import random
import re

all_characters = string.printable
n_characters = len(all_characters)

file = unidecode.unidecode(open('dickens_train.txt').read())
file_len = len(file)
print('file_len =', file_len)

To make inputs out of this big string of data, we will be splitting it into chunks.

In [None]:
chunk_len = 200

def random_chunk():
    start_index = random.randint(0, file_len - chunk_len -1) 
    end_index = start_index + chunk_len + 1
    return file[start_index:end_index]

print(random_chunk())

# Build the Model (30 points)

<div class="alert alert-block alert-info">

The model that you are asked to build will take as input characters up to step $t-1$ and is expected to produce a distribution over characters at step $t$ (which can then be used to sample one character from that distribution). 

There are three layers: one layer that maps the input character into its embedding, one LSTM layer (which may itself have multiple layers) that operates on that embedding and a hidden and cell state, and a decoder layer that outputs the probability distribution.

The beauty of frameworks such as PyTorch is that you can express this pretty directly in code, adding (pre-defined) layers to your network.
</div>

In [None]:
#student solution/discussion here

import torch
import torch.nn as nn
#you can import additional libraries if needed.

# Here is a pseudocode to help with your LSTM implementation. 
# You can add new methods and/or change the signature (i.e., the input parameters) of the methods.
class LSTM(nn.Module):
    def __init__(self):
        """Think about which (hyper-)parameters your model needs; i.e., parameters that determine the
        exact shape (as opposed to the architecture) of the model. There's an embedding layer, which needs 
        to know how many elements it needs to embed, and into vectors of what size. There's a recurrent layer,
        which needs to know the size of its input (coming from the embedding layer). PyTorch also makes
        it easy to create a stack of such layers in one command; the size of the stack can be given
        here. Finally, the output of the recurrent layer(s) needs to be projected again into a vector
        of a specified size."""
    
    def forward(self):
        """Your implementation should accept input character, hidden and cell state,
        and output the next character distribution and the updated hidden and cell state."""

    def init_hidden(self):
        """Finally, you need to initialize the (actual) parameters of the model (the weight
        tensors) with the correct shapes."""        

# Inputs and Targets

Each chunk of the training data needs to be turned into a sequence of numbers (of the lookups), specifically a `LongTensor` (used for integer values). This is done by looping through the characters of the string and looking up the index of each character in `all_characters`.

In [None]:
from torch.autograd import Variable 

# Turn string into list of longs
def char_tensor(string):
    tensor = torch.zeros(len(string)).long()
    for c in range(len(string)):
        tensor[c] = all_characters.index(string[c])
    return Variable(tensor)

print(char_tensor('abcDEF'))

Finally we can assemble a pair of input and target tensors for training, from a random chunk. The input will be all characters *up to the last*, and the target will be all characters *from the first*. So if our chunk is "abc" the input will correspond to "ab" while the target is "bc".

In [None]:
def random_training_set():    
    chunk = random_chunk()
    inp = char_tensor(chunk[:-1])
    target = char_tensor(chunk[1:])
    return inp, target

Play around with these functions to understand what they do.

# Generating

We also provide a generator function that shows you how you can sample from your model (and how we expect the interface to work). 

`decoder` is your model that is passed into the function. To start generating, we pass a priming string to start building up the hidden state, from which we then generate one character at a time. To generate strings with the network, we will feed one character at a time, use the outputs of the network as a probability distribution for the next character, and repeat. 

In [None]:
def generate(decoder, prime_str='A', predict_len=100, temperature=0.8):
    hidden, cell = decoder.init_hidden()
    prime_input = char_tensor(prime_str)
    predicted = prime_str

    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, (hidden, cell) = decoder(prime_input[p], hidden) 
    inp = prime_input[-1]
    
    for p in range(predict_len):
        output, (hidden, cell) = decoder(inp, (hidden, cell))
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = char_tensor(predicted_char)

    return predicted

# Training

A helper to print the amount of time passed:

In [None]:
import time, math

def time_since(since):
    s = time.time() - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

The main training function

In [None]:
def train(decoder, decoder_optimizer, inp, target):
    hidden, cell = decoder.init_hidden()
    decoder.zero_grad()
    loss = 0

    for c in range(chunk_len):
        output, (hidden, cell) = decoder(inp[c], (hidden, cell))
        loss += criterion(output, target[c].view(1))

    loss.backward()
    decoder_optimizer.step()

    return loss.item() /chunk_len

Then we define the training parameters, instantiate the model, and start training:

In [None]:
n_epochs = 3000
print_every = 100
plot_every = 10
hidden_size = 128
n_layers = 2

lr = 0.005
decoder = LSTM(n_characters, hidden_size, n_characters, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

start = time.time()
all_losses = []
loss_avg = 0

for epoch in range(1, n_epochs+1):
    loss = train(decoder, decoder_optimizer, *random_training_set())
    loss_avg += loss

    if epoch % print_every == 0:
        print('[{} ({} {}%) {:.4f}]'.format(time_since(start), epoch, epoch/n_epochs * 100, loss))
        print(generate(decoder, 'A', 100), '\n')

    if epoch % plot_every == 0:
        all_losses.append(loss_avg/ plot_every)
        loss_avg = 0

Explain in your words what is going on here. What do these parameters do, what is happening inside the training loop? (**bonus question**)

# Hyperparameter tuning (30 points)

<div class="alert alert-block alert-info">

Building neural networks is to some extent more an art than a science. As we have seen above, there are several hyperparameters (i.e., parameters that are not optimized during learning, but that determine the shape of the network), and their setting influences the performance. In this problem, you're asked to *tune* these hyperparameters (that is, optimize heuristically, rather than using for example stochastic gradient descent). You can try to do this systematically (how?), or just in general explore what changing the parameter does to the performance. (Keep in mind the time it takes to train again for each setting.)

To do so, you need a target. We'll use bits per character (BPC) over the entire the test set `dickens_test.txt`. 
BPC is defined as the empirical estimate of the cross-entropy between the target distribution and the model output in base 2. 

(Hint1: You can adapt the formula for word-level cross-entropy given in your text book (chapter 9) to character-level as $-\frac{1}{T}*\sum_{i=1}^{T}log{_2}{m(x_t)}$ where T is the length of input string and $x_t$ is the true character in input string at location $t$.)

(Hint2: Tune one parameter at a time) 

(Hint3: Keep a log of your experiments for "parameters used --> minimum loss value")

</div>

In [None]:
#student solution/discussion here

# Plotting the Training Losses (20 points)


<div class="alert alert-block alert-info">
An important aspect of deep network training task is visualization. Visualizing the training loss values would be helpful for debugging the system. For instance, at extremes, a learning rate that is too large will result in weight updates that will be too large and the performance of the model (such as its loss on the training dataset) will oscillate over training epochs. You would set the learning rate which do not cause oscillation with the help of visual charts.
    
In this exercise, we ask you to add the loss charts of experiments with different learning rates on the same graph and plot the graph. Add an entry for each experiment to the legend of the graph. If there is more than 10 experiments, use more than 1 chart (up to 10 experiments for each chart).

</div>

In [None]:
#student solution/discussion here

# Generating at different "temperatures" (20 points)

<div class="alert alert-block alert-info">

In the `generate` function above, every time a prediction is made, the outputs are divided by the "temperature" argument passed. 

Generate strings by using different `temperature` values and evaluate the results qualitatively. Create chunks from the test set (200 character length as above) and take the first 10 characters of a randomly chosen chunk as a priming string.
What you observe in the output when you increase the `temperature` values? **In your understanding**, why does changing the temperature affect the output as the way you observed?
</div>

In [None]:
#student solution/discussion here

(If you still have energy / time left once you've reached this place here, try using other datasets (other texts; other types of text, like for example the Linux source code), other layers (e.g., a GRU instead of LSTM), etc. etc..) (**bonus question**)