<a href="https://colab.research.google.com/github/dgromann/SemComp_WS2018/blob/master/Tutorial8/Tutorial8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lesson 0.0.0: Store this notebook! 

Go to "File" and make sure you store this file as a local copy to either GitHub or your Google Drive. If you do not have a Google account and also do not want to create one, please check Option C below. 

Option A) Google Drive WITH collaboration

If you want to work in a collaborative manner where each of you in the group can see each other's contributions, one of you needs to store the notebook in Google Drive and share it with the others. You share it by clicking on the SHARE button on the top right of this page and share the link with the "everyone who receives this link can edit" option with the other team members per e-mail, skype, or any other way you prefer.

If you work with others, keep in mind to always copy the code before you edit it and always indicate your name as a comment (e.g. #Dagmar ) in the cell that it is clear who wrote which part. I also recommend creating a new code cell for your contributions.

Option B) Github without collaboration

Collaborative functions are not available when storing the notebook in GitHub; you will see your own work but not that of others.


Option C) Download this notebook as ipynb (Jupyter notebook) or py (Python file)

To run either of these on your local machine requires the installation of the required programs, which for the first tutorial are Python and NLTK. This will become more as we continue on to machine learning (requiring sklearn) and deep learning (requiring tensorflow and/or pytorch). In Google Codelab all of these are provided and do not need to be installed locally.


# Lesson 1.0: Text generation with LSTMs

Let's generate our first simple text generation model. 

## Data proprocessing

We will load all required software libraries and data and then preprocess the data. This includes generating a batch generator.

In [0]:
# Get data and required software packages

!wget https://raw.githubusercontent.com/dgromann/SemanticComputing/master/tutorial8/trump_tweets.txt
!pip3 install torch

import random
import math
import unidecode
import string
import re

import torch
import torch.nn as nn
from torch.autograd import Variable

In [0]:
# Load the data 
file = unidecode.unidecode(open('trump_tweets.txt').read())
file_len = len(file)
print('file_len =', file_len)

In [0]:
# Separate data into shorter sequences 
sequence_length = 200

def data_partitioning():
    start_index = random.randint(0, file_len - sequence_length)
    end_index = start_index + sequence_length + 1
    return file[start_index:end_index]

print(data_partitioning())

## Building the model

Next we need to build our model. We will start with a simple RNN.

In [0]:
# EXERCISE: build some kind of RNN model (GRU or LSTM preferably)
# Each step of this exercise is defined below
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers=1):
        super(RNN, self).__init__()
        # Step 1: initialize all the parameters given to the function 

        # The input to hidden connection has already been provided for you
        self.embedding = nn.Embedding(input_size, hidden_size)
        # Step 2: initialize a GRU or LSTM layer as a hidden layer

        # Step3 : define a linear output layer 

        
    def init_hidden(self):
        # Step4: initialize the hidden state - be aware of the differences in 
        # type of network - for LSTM you need a hiddena and a cell state 
        # the variables are (num_lyers, minibatch_size, hidden_dim)

        # For a GRU you only need one hidden layer 

    
    def forward(self, input, hidden):
        input = self.embedding(input.view(1, -1))
        # Step 5: define the forward function for the lstm hidden layer (input.view(1, 1, -1))
        
        #Step 6: define the forward function for the linear output layer (output.view(1, -1))

        
        return output, hidden


In [0]:
# EXERCISE: instantiate the above RNN model and print all model parameters that are 
# considered in backpropagation; you can address ```model.named_parameters()''' 
# that returns a kind of dictionary with names and values of the parameters (values.data) 



## Preparing data for training

We need to specify the input and the output of the network. 

In [0]:
all_characters = string.printable
n_characters = len(all_characters)

# Turn string into list of longs
def char_tensor(string):
    tensor = torch.zeros(len(string)).long()
    for c in range(len(string)):
        tensor[c] = all_characters.index(string[c])
    return Variable(tensor)

# Generate individual batches matching input and target vectors
def batch_generator():    
    partition = data_partitioning()
    inp = char_tensor(partition[:-1])
    target = char_tensor(partition[1:])
    return inp, target


## Training

Let's define everything needed for training.

In [0]:
# Sampling a predicted sequence from the network
def sample_prediction(prime_str='a', predict_len=100, threshold=0.8):
    hidden = model.init_hidden()
    prime_input = char_tensor(prime_str)
    predicted = prime_str

    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, hidden = model(prime_input[p], hidden)
    inp = prime_input[-1]
    
    for p in range(predict_len):
        output, hidden = model(inp, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(threshold).exp()
        top_i = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = char_tensor(predicted_char)

    return predicted

print_every = 100
plot_every = 10
n_layers = 1

# EXERCISE: set the following parameters 
n_epochs = 0
hidden_size = 0
lr = 0

# EXERCISE: initialize the mode and specify an optimizer 
# because of the distribution used for the sample predictions 
# we need to use CrossEntropyloss() as loss function
model = None
model_optimizer = None
loss_function = nn.CrossEntropyLoss()

# Track losses for later visualization
all_losses = []
loss_avg = 0

for epoch in range(1, n_epochs + 1):
    # Get the batch for this epoch
    inp, target = batch_generator()
    
    # Clear the accumulated gradients before we start training  
    model.zero_grad()
    
    # Initialize the hidden state
    hidden = model.init_hidden()
    
    temp_loss = 0

    # Calculate and accummulate loss for each sequence in the batch
    for c in range(sequence_length):
        output, hidden = model(inp[c], hidden)
        temp_loss += loss_function(output, target[c].unsqueeze(0))

    # Backpropagation and optimization of parameters
    temp_loss.backward()
    model_optimizer.step()

    # Calculate average loss over whole batch 
    loss = temp_loss.data.item() / sequence_length       
    loss_avg += loss

    if epoch % print_every == 0:
        print('[Epoch %d, %d%% of total, current loss: %.4f]' % (epoch, epoch / n_epochs * 100, loss))
        print(sample_prediction('Th', 100), '\n')

    if epoch % plot_every == 0:
        all_losses.append(loss_avg / plot_every)
        loss_avg = 0


In [0]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

plt.figure()
plt.plot(all_losses)

# Lesson 1.1: Experiment with hyperparameters and optimization

Experiment with the following attributes to improve on the performance of the above model (performance here can only 
be evaluated by looking at how sensible the produced tweets are): 

* Change the ```threshold``` number in the ```sample_prediction``` method above - the higher the less importance to high probabilities will be given, ergo the lower the more important high output probabilities will be
* Experiment with different optimizers (see [PyTorch Optimizers](https://pytorch.org/docs/stable/optim.html))
* Increase the depth of the network (what was the depth again?)
* Experiment with the number of epochs

Once you are done with optimizing the result and hopefully have obtained some nice automatically generated tweets, you can try the generation process on a different corpus. For instance, a Tweet collection of someone else or frequently these types of language models are applied to some Shakespeare corpus, e.g. [this one provided by Andrew Karpathy](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).
