<a href="https://colab.research.google.com/github/dgromann/SemComp_WS2018/blob/master/tutorial7/Tutorial7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lesson 0.0.0: Store this notebook! 

Go to "File" and make sure you store this file as a local copy to either GitHub or your Google Drive. If you do not have a Google account and also do not want to create one, please check Option C below. 

Option A) Google Drive WITH collaboration

If you want to work in a collaborative manner where each of you in the group can see each other's contributions, one of you needs to store the notebook in Google Drive and share it with the others. You share it by clicking on the SHARE button on the top right of this page and share the link with the "everyone who receives this link can edit" option with the other team members per e-mail, skype, or any other way you prefer.

If you work with others, keep in mind to always copy the code before you edit it and always indicate your name as a comment (e.g. #Dagmar ) in the cell that it is clear who wrote which part. I also recommend creating a new code cell for your contributions.

Option B) Github without collaboration

Collaborative functions are not available when storing the notebook in GitHub; you will see your own work but not that of others.


Option C) Download this notebook as ipynb (Jupyter notebook) or py (Python file)

To run either of these on your local machine requires the installation of the required programs, which for the first tutorial are Python and NLTK. This will become more as we continue on to machine learning (requiring sklearn) and deep learning (requiring tensorflow and/or pytorch). In Google Codelab all of these are provided and do not need to be installed locally.


#Lesson 0.0.1: Repository of PyTorch tutorials:

Online free PyTorch tutorials:

* Official PyTorch tutorials: https://pytorch.org/tutorials/
* Official PyTorch documentation: https://pytorch.org/docs/0.4.1/
* Basic nice pytorch tutorials: https://github.com/yunjey/pytorch-tutorial
* Sequence Modeling Toolkit in Pytorch, for e.g. Neural Machine Translation: https://github.com/pytorch/fairseq


# Lesson 0.1: PyTorch tutorial - Brief refresher backprop and optimizer

Let's look at our first neural network in Pytorch. 




In [0]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random

torch.manual_seed(1)

We have looked at how backpropagation works in PyTorch a little bit before. For a quick refresher, please look 
at the code below and do the small exercises. 

In [0]:
x = torch.randn(2, 2) # as oppsosed to torch.tensor([1, 2, 3], requires_grad=True)
y = torch.randn(2, 2)

# By default, user created Tensors have "requires_grad=False"
print(x.requires_grad, y.requires_grad)
z = x + y

# So backpropagation on z is not possible
print("z does not have enough info to compute gradients: ", z.grad_fn)

# ".requires_grad_( ... )" changes an existing tensor's "requires_grad"
# flag in-place. The input flag defaults to "True" if not given.
x = x.requires_grad_()
y = y.requires_grad_()

# now z contains enough information to compute the gradients
z = x + y
print("z now has enough info:", z.grad_fn)

# If the input to an operation has "requires_grad=True", also the output has the same flag
print("z allows for backprop: ", z.requires_grad)

# Now z has the computation history that relates itself to x and y
# EXERCISE: How can we detach the values of z from this history and relation to x and y? 

# EXERCISE: does new_z have information to backpropagate to x and y? 


# And it also should not be able to backpropagate to x and y since detach 
# "forgets" the comutation history and only copies the values of z to new_z.
# Thus, it does not know how it was computed. 

# We can also detach a tensor temporarily from its history by using torch.no_grad():

print("x requires_grad == True: ", x.requires_grad)
print("x ** 2 requires_grad == ", (x ** 2).requires_grad)

with torch.no_grad():
  print("x ** 2 requires_grad == ", (x ** 2).requires_grad)

print("x ** 2 requires_grad == ", (x ** 2).requires_grad)


# Lesson 1: PyTorch Tutorial - First Model

Neural networks obtain much of their power by combining linear and non-linear functions in clever ways. In this lesson, we will learn these core components, make up an objective function and see how to train a model. 

## Base Model - Linear

The base class for all neural network modules is  ```torch.nn.Module``` and all of your models should be subclasses of 
this base class, e.g. ```torch.nn.Linear```generates a linear layer. Each module has a list of parameters (e.g. size of input features and size of output features) that are subclasses of the class ```torch.nn.Parameter```. 

Remember a simple linear NN connects the input and the hidden layer in the following way: 

> $F(x) = Wx + b$


In [0]:
# Initialize a linear model of shape 5 for input features and shape 3 
# for output features with a bias set to True per default
linear = nn.Linear(5, 3)
print(linear)

# Randomly initialize a tensor
data = torch.randn(2, 5)
print(data)

# EXERCISE: Can we map data under A? That is, map from a five dimensional to a 3 dimensional space as defined above?


## Non-Linearity

If we want to add a non-linear activation function to the above equation, we can for instance choose ReLU: 

> $ F(x) = ReLU (Wx + b)$

In [0]:
# F is the library "torch.nn.functional" imported above
# EXERCISE: What happens here? What does ReLU do again?
print(F.relu(data))

# EXERCISE: Put "data" through a softmax and explain the output
# Hint: for softmax you need to define a dimension by saying (data, dim=0)
# dimension specifies along which dunebsion softmax will be computed

# EXERCISE: What happens if you change the dimensionality to 1? 
# What does the "dim" indiciation mean? 


## Loss function

The loss funtion is the function that your network is trained to minimize. It computes how far off your network is from the correct answer. Thus, it shows how confident your network is with its prediction. If the loss is very high, the network is confidentn in its answer and the answer is wrong. Uf the answer is correct and the network is confident in its answer, the loss will be low. With a small loss, it will hopefully generalize well unless it overfitted to the training data. 

In backpropagation, we take the derivative of the loss function to start computing the gradient as we go back through the network. All network components inherit from the ```nn.Module``` function and overried the ```forward()``` function. 

In [0]:
class FirstModel(nn.Module):
  
  def __init__(self, vocab_size, num_labels):
    super(FirstModel, self).__init__()
    self.linear = nn.Linear(vocab_size, num_labels)
  
  def forward(self, vec):
    return F.log_softmax(self.linear(vec), dim=1)

def generate_input(num):
  return torch.rand(num, 10)

def generate_target():
  return torch.LongTensor([random.randint(0,2)])

model = FirstModel(10, 3) 
# EXERCISE: use torch.randn to create a random vector of the correct input size for this model
vector = torch.randn(4, 10)
print("Output: ", vector.view(1,1,-1))

      
loss_function = nn.NLLLoss() # NLLLoss() Negative Log Likelihood, aka multi-class cross-entropy 
optimizer = optim.SGD(model.parameters(), lr=0.1) # Stochastic Gradient Descent as optimizer 


for epoch in range(100): 
    for vec in generate_input(4):
      # Let's clear the gradients before we start training since PyTorch accumulates them
      model.zero_grad()
      
      # EXERCISE: What happens to the loss currently if we change the target to one of three classes?
      # You can use the provided function generate_target for that purpose
      # Binary classification but we always set the label to 1 currently
      target = torch.LongTensor([1])

      # This function gets the logit probabilities 
      log_probs = model(vec.view(1,-1))
      
      # This function calculates the loss 
      loss = loss_function(log_probs, target)
      
      print("Loss: ", loss)
      loss.backward()
      optimizer.step()




# Lesson 2: Long-Short Term Memory (LSTM)

LSTMs are our first non-linear model that we are going to build. Below is a very simple example on how to start building a very simple model.

In [0]:
lstm = nn.LSTM(3, 3)
inputs = torch.rand(1,3)

hidden = (torch.randn(1,1,3), torch.randn(1,1,3))

for i in inputs: 
  out, hidden = lstm(i.view(1,1,-1), hidden)  
print("Toy output: ", out, "Hidden: ", hidden)

Toy output:  tensor([[[ 0.1219,  0.4065, -0.0905]]], grad_fn=<StackBackward>) Hidden:  (tensor([[[ 0.1219,  0.4065, -0.0905]]], grad_fn=<StackBackward>), tensor([[[ 0.5348,  0.8511, -0.2229]]], grad_fn=<StackBackward>))


Let's make it a bit more complicated and build a POS tagger as an LSTM. Remember, LSTMs have the following shape and calculcates the following functions for each input sequence: 


> $f_t = \sigma(W_f x_t + U_f h_{(t-1)} + b_f)$

> $i_t = \sigma(W_i x_t + U_i h_{(t-1)} + b_i)$

> $\tilde c_t = \tanh(W_g x_t + U_g h_{(t-1)} + b_g)$

> $o_t = \sigma(W_0 x_t + U_0 h_{(t-1)} + b_o)$

> $c_t = f_t * c_{(t-1)} + i_t * \tilde c_t$

> $h_t = o_t * \tanh(c_t)$




In [0]:
# Functions to prepare the input
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    ("The frog ate the fly".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Paul likes the book".split(), ["NN", "V", "DET", "NN"]),
    ("All type the word".split(), ["NN", "V", "DET", "NN"]),
    ("The car broke".split(), ["DET", "NN", "V"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])]
word_to_ix = {}

for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 300 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'frog': 1, 'ate': 2, 'the': 3, 'fly': 4, 'Paul': 5, 'likes': 6, 'book': 7, 'All': 8, 'type': 9, 'word': 10, 'car': 11, 'broke': 12, 'Everybody': 13, 'read': 14, 'that': 15}


In [0]:
# Code cell to define the model 
class LSTMTagger(nn.Module):
    """ LSTM model to tag words with their correct part-of-speeches"""
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
       
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # input dimensionality (embedding_dim) and output dimensionality (hidden_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # EXERCISE: initialize the following linear layer to connect the hidden and the output (tag space) layer
        self.hidden2tag =  
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # The variables here are (num_lyers, minibatch_size, hidden_dim)
        h0 = torch.zeros(1, 1, self.hidden_dim)
        c0 = torch.zeros(1, 1, self.hidden_dim)
        return (h0, c0)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [0]:
# Auxiliary functions

def get_accuracy(targets, prediction): 
  (max_vals, arg_maxs) = torch.max(prediction.data, dim=1) 
  num_correct = torch.sum(targets==arg_maxs)
  acc = (num_correct * 100.0 / len(targets))
  return acc.item()  # percentage based


In [0]:
# Code cell to train the model 
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))

# EXERCISE: define the loss function and an optimizer


# Here you can output some sample scores
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)
    print(F.softmax(tag_scores))
  
num_epochs = 100
for epoch in range(num_epochs):  # again, normally you would NOT do 300 epochs, it is toy data
    accuracy = 0 
    for sentence, tags in training_data:
        
        # EXERCISE: fill in this part of the code - the required steps are provided
        # Step 1: clear the accumulated gradients before we start training 

        # Step 2: since this is an LSTM we need to initialize the hidden states
        # and clear the history from the last instance

        # Step 3: prepare the input sequence for the network (words to indices) - see available functions
        
        # Step 4: prepare the target sequences to be able to calculate the loss - see available functions 
               
        # Step 5: run a forward pass 
          
        # Step 6: compute the loss, calculate the gradient and optimize the weights
        
        accuracy += get_accuracy(targets, prediction)
        
    print("Epoch %s of %s, Loss %s, Accuracy %s " % (epoch, num_epochs, loss, accuracy/len(training_data)))