## Homework 3 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- Recurrent Neural Networks
- Long Short-Term 
- Saving and Loading NN models using Pytorch
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: February 1, 2020, 18:00:00 (Vancouver time)

## Getting Started

In [None]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim
import torch
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
from torchtext.data import Iterator, BucketIterator

In [None]:
# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

## Exercise 1: Building a Recurrent Neural Network

In this exercise, we provide a corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). HappyDB is a dataset of about 100,000 `happy moments` crowd-sourced via Amazon’s Mechanical Turk where each worker was asked to describe in a complete sentence `what made them happy in the past 24 hours`. Each user was asked to describe three such moments. 
In this exercise, we focus on sociality classification. We only use labelled dataset which include 10,562 labelled samples. 

We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``data/happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.

#### 1.1 Write code to  define a `whitespace_tokenizer` whose input is a tweet text.

rubric={accuracy:1}

In [None]:
def whitespace_tokenize(text):
    # your code goes here 
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return text.strip().split()

#### 1.2 Wrote code to define `TorchText's Fields` handle how data should be processed.  
Hint: You need 2 Fields `TEXT` and `LABEL` for tweet text and label respectively. 

* The tokenizer is the whitespace_tokenizer in 1.1.
* Use the truecase of words. 

rubric={accuracy:2} 

In [None]:
# your code goes here 
TEXT = Field(sequential=True, tokenize=whitespace_tokenize, lower=False)
LABEL = Field(sequential=False, unk_token = None)

#### 1.3 Write code to use `TabularDataset class`  and `Fields` to process the tsv files (train, dev, and test). 

Hint 1: `Fields` will call tokenizer, and convert tokens to numerical index. 

Hint 2: Use `TabularDataset.splits(...)` to load train, dev, and test sets.

rubric={accuracy:2}

In [None]:
# your code goes here 
train, val, test = TabularDataset.splits(
               path="./data/happy_db/", # the root directory where the data lies
               train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
               format='tsv',
               skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
               fields=[('tweet', TEXT), ('label', LABEL)])

#### 1.4  Write code to build your vocabulary to map words and labels to integers. The maximum size of vocabulary is 5,000 (not count `<pad>` and `<unk>` in).
Hint: Use `build_vocab` to build vocabularies for `TEXT` field and `LABEL` field respectively. 

rubric={accuracy:2}

In [None]:
# your code goes here 
TEXT.build_vocab(train, max_size=5000)
LABEL.build_vocab(train)

#### 1.5 Write code to print the sizes of your two vocabularies (i.e., `TEXT` and `LABEL`) individually. 

rubric={accuracy:2}

In [None]:
# your code goes here 
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))

#### 1.6 Write code to construct the Iterators to get the train, dev, and test splits. Use `BucketIterator` to initialize the Iterators for the train, dev, and test data.
* apply same batch size of 32 on train, dev, and test set.
* Samples are sorted by length.
* Sort samples within each batch

Hint: Use `BucketIterator.splits(...)`

rubric={accuracy:2}

In [None]:
# your code goes here 

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(32,32,32),
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=True
)

#### 1.7.1 Write code to create a class called `LSTMmodel` to define a classifier of the task. In this model, you should have:

1. An embedding layer with input dimention equal to the size of your `TEXT` vocabulary, and that represent each token in a 300-dimentional vector.  The parameters of this embedding layer should be randomly initialized with numbers sampled from a normal distribution (mean 0 and variance 0.05).

2. Two uni-directional `LSTM` layers. Each layer has 500 hidden unites.

3. Pass the last hidden of last layer into a `Tanh` activation function, then feed the output of `Tanh` to a linear layer whose dimensionality is equal to the number of classes in the dataset (i.e., 2 in our case).

4. Then, a `LogSoftmax` layer on top of the outcome of `linear layer`.

5. Return the output of `LogSoftmax` layer.

Hint: `Tanh` might not be the ideal function to use, but we want you to explore it. (Usually `ReLU` works well).

rubric={accuracy:8, quality:4}

In [None]:
# your code goes here 

class LSTMmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size, hidden_size, num_layers):
    # In the constructor we define the layers for our model (same as our previous RNN)
    super(LSTMmodel, self).__init__()
    # word embedding lookup table
    self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size)
    self.embedding.weight.data.normal_(0.0,0.05) # mean=0.0, mu=0.05
    # core LSTM module
    self.lstm_rnn = nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers) # input_size, hidden_size, num_layers
    self.activation_fn = nn.Tanh()
    self.linear_layer = nn.Linear(hidden_size, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=0)
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.embedding(x)
    out, (h_state, c_state) = self.lstm_rnn(out) # h_0 initialized to zeros by default
    # classify based on the hidden representation at the last token
    out = out[-1] # unsqueeze converts 1D input (D dimension) into 2D input (1xD) 
    out = self.linear_layer(out)
    out = self.softmax_layer(out) # accepts 2D or more dimensional inputs
    return out

#### 1.7.2 Write code to Instantiate the model class with aforementioned hyper-parameters and print out the model architecture.
rubric={accuracy:3}

In [None]:
# your code goes here 

EMBEDDING_SIZE = 300 
VOCAB_SIZE = 5002
NUM_CLASSES = 2
HIDDEN_SIZE = 500
NUM_LAYERS = 2
model = LSTMmodel(EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS)
model = model.to(device)
print(model)

**Create an optimizer for training**

In [None]:
LEARNING_RATE = 0.1
criterion = nn.NLLLoss()
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#### 1.8.1 How many learnable (or updatable) parameters are present in the model defined in 1.7. Compute the result by writing code.
rubric={accuracy:1}

In [None]:
# your code goes here 
count = 0
for p in model.parameters():
    count += p.numel()
print("the number of parameters:",count)

#### 1.8.2 OPTIONAL QUESTION: How many megabyte memory will this model use? Please show your work. 

Hint1: Each parameter is a `torch.float32` tensor which is 32-bit floating point. 

Hint2: All the parameters of this model are learnable (or updatable) parameters.

rubric={spark:2}

$5109602 * 32 bits = 163507264 bits$

$1bit = 1.25e-7 Megabyte$

$163507264 bits = 20.438408 Megabyte$ 

Actully, you can find the size of the checkpoint (i.e., model_23.pt) is 20.4 MB.

**To facilitate your work, we provide two function for training and evaluation.**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return accuracy,f1score

**Here is the code I used to train the model on GPU. The best model on validation set was trained with 23 epochs.**

In [None]:
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

# start the training
for epoch in range(30):
    # train the model for one pass over the data
    train_loss = train(train_iter)  
    # compute the training accuracy
    train_acc,f1t = evaluate(train_iter)
    # compute the validation accuracy
    val_acc,f1v = evaluate(val_iter)
    
    # print the loss for every epoch
    print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, 30, train_loss, train_acc, val_acc))
    
    # save model, optimizer, and number of epoch to a dictionary
    model_save = {
            'epoch': epoch,  # number of epoch
            'model_state_dict': model.state_dict(), # model parameters 
            'optimizer_state_dict': optimizer.state_dict(), # save optimizer 
            'loss': train_loss # training loss
            }
    
    # use torch.save to store 
    torch.save(model_save, "./ckpt/model_{}.pt".format(epoch))

#### 1.9 Please read [Pytorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html) and write code to load the trained model checkpoint from directory: `./ckpt/model_24.pt`.

Note this ckpt is a dictionary which includes three keys: epoch, model_state_dict, and optimizer_state_dict.
The model parameters are the values of "model_state_dictm".



rubric={accuracy:2}

In [None]:
# your code goes here 
checkpoint = torch.load("./ckpt/model_23.pt")
model.load_state_dict(checkpoint['model_state_dict'])

#### 1.10 Write code to evaluate the trained model on test set and report the test accuracy and F-score. 
Report the performance of the trained model on the test set is 84.186% on accuracy (fill in your accuracy).

Report the performance of the trained model on the test set is 84.021 on F1_score (fill in your F1_score).

rubric={accuracy:2,quality:1}

In [None]:
print(evaluate(test_iter))

# 2 Short Answer Questions

### 2.1 Non-Linearity Review
rubric={reasoning:3}

There are a number of different non-linearities that can be used in our neural networks, for instance: **Sigmoid, tanh, and ReLU**. Different variations and tweaks to these non-linearities get introduced fairly frequently, and it pays to have a sense of why you might pick one non-linearity for your network instead of another. **Explain how these three (Sigmoid, tanh, and ReLU) are different and what might be some pros and cons of using them in a neural network**. (There are some great blog/quora posts talking about this topic, but if you are going to be summarizing from a post please include a link to the resource).


Hint: Your answer should be a maximum of 2-3 paragraphs. Short answers are just fine.

#### Write your answer here.
Examples:
- Sigmoid and Tanh can both run into issues where the gradients get very small when input gets close to (+inf or -inf), this is the vanishing gradient problem.

- Tanh is between (-1,1) which makes training a little bit nice than sigmoid.

- ReLU is super fast to compute and doesn't have the vanishing gradient issue.

There are of course other things you could have found in your research, mainly looking that you are able to look through some resources and weigh the pros and cons of the different non-linearities.

### 2.2 ReLU Variations
rubric={reasoning:1}

PyTorch supports several other non-linearities, find two variations on ReLU that PyTorch implements (see https://pytorch.org/docs/stable/nn.html) and explain how they are different from standard ReLU.



#### Write your answer here.

Picking one as an example: 

- Leaky ReLU is like ReLU but allows for some small (tiny!) value when the input is negative. This might be important if you still want to differentiate between negative input values and could potentially be slightly more powerful than ReLU at the expense of being slightly more expensive to compute. 

There are a lot of other choices, but just consider what might be some benefits for these other "advanced" activation functions?

### 2.3 Wait why are we using Logs?
rubric={reasoning:1}

What is the purpose of taking logs of probabilities, as in the case of NLLLoss and LogSoftmax?

Hint: A short answer is just fine.


#### Write your answer here.

You can run into underflow issues with probabilities. likewise logs of probabilities allow you to calculate things in terms of sums (cheap) as opposed to multiplication, potentially allowing for faster computation of values. If you just care about the biggest probability, the biggest log probability is still going to be larger than the rest, so we can do everything with logs that we could with non-logs.

### 2.4 Softmax
rubric={reasoning:2}

For a multiclass problem with a Softmax layer, give an example of a hypothetical Softmax output with 3 classes (hint, think back to Lab1 for what that output might look like). Generally, which of the three classes in your example should your classifier pick? Why might we care about values corresponding to more than one of these classes?

Hint: Sometimes knowing what the top $n$ most likely classes is of interest. Why?

#### Write your answer here.

Say for instance your output from Softmax looks like: $[.3,.6,.1]$ first it should be a probablity distribution (the elements should sum to 1), second you should generally pick the second class here (.6) since it is the highest probability given. There are a couple of reasons why you might care about the other numbers though, one reason is if you want to say handle ambiguous situations slightly differently, say you have the top two classes close together, but both fairly low, in some situations you might want to handle things without just predicting the top class.

### 2.5 RNNs
rubric={reasoning:1}

RNNs are quite powerful for dealing with certain types of data (such as sequences), however, they have some drawbacks. What are some of these drawbacks? (List two drawbacks)

Hint: Look at RNN slides, including the last few slides.


#### Write your answer here.

Clear issues include dealing with long sequences and the runtime of the algorithm (it can't be parallelized).

### 2.6 LSTMs vs RNNs

rubric={reasoning:2}

LSTMs help alleviate one of the issues with RNNs (see 2.5). What do they alleviate? Do you see any problems that the model might still have (just reason generally about the model based on your understanding).

Hint: Short answer is good.


#### Write your answer here.

LSTMs are better at sequence length issues, [although attention based methods which we will get to later in the semester turn out to be much better. They will still run into issues with very very long sequence lengths, although now they are able to learn to pay attention to certain things over the course of the sequence. They still have the same weakness in terms of speed, as they can't be parallelized.