## Homework 3 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- Recurrent Neural Networks
- Long Short-Term 
- Saving and Loading NN models using Pytorch
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: February 1, 2020, 18:00:00 (Vancouver time)

## Getting Started

In [None]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim
import torch
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
from torchtext.data import Iterator, BucketIterator

In [None]:
# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

## Exercise 1: Building a Recurrent Neural Network

In this exercise, we provide a corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). HappyDB is a dataset of about 100,000 `happy moments` crowd-sourced via Amazon’s Mechanical Turk where each worker was asked to describe in a complete sentence `what made them happy in the past 24 hours`. Each user was asked to describe three such moments. 
In this exercise, we focus on sociality classification. We only use labelled dataset which include 10,562 labelled samples. 

We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``data/happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.

#### 1.1 Write code to  define a `whitespace_tokenizer` whose input is a tweet text.

rubric={accuracy:1}

In [None]:
def whitespace_tokenize(text):
    # your code goes here 

#### 1.2 Wrote code to define `TorchText's Fields` handle how data should be processed.  
Hint: You need 2 Fields `TEXT` and `LABEL` for tweet text and label respectively. 

* The tokenizer is the whitespace_tokenizer in 1.1.
* Use the truecase of words. 

rubric={accuracy:2} 

In [None]:
# your code goes here 


#### 1.3 Write code to use `TabularDataset class`  and `Fields` to process the tsv files (train, dev, and test). 

Hint 1: `Fields` will call tokenizer, and convert tokens to numerical index. 

Hint 2: Use `TabularDataset.splits(...)` to load train, dev, and test sets.

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.4  Write code to build your vocabulary to map words and labels to integers. The maximum size of vocabulary is 5,000 (not count `<pad>` and `<unk>` in).
Hint: Use `build_vocab` to build vocabularies for `TEXT` field and `LABEL` field respectively. 

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.5 Write code to print the sizes of your two vocabularies (i.e., `TEXT` and `LABEL`) individually. 

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.6 Write code to construct the Iterators to get the train, dev, and test splits. Use `BucketIterator` to initialize the Iterators for the train, dev, and test data.
* apply same batch size of 32 on train, dev, and test set.
* Samples are sorted by length.
* Sort samples within each batch

Hint: Use `BucketIterator.splits(...)`

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.7.1 Write code to create a class called `LSTMmodel` to define a classifier of the task. In this model, you should have:

1. An embedding layer with input dimention equal to the size of your `TEXT` vocabulary, and that represent each token in a 300-dimentional vector.  The parameters of this embedding layer should be randomly initialized with numbers sampled from a normal distribution (mean 0 and variance 0.05).

2. Two uni-directional `LSTM` layers. Each layer has 500 hidden unites.

3. Pass the last hidden of last layer into a `Tanh` activation function, then feed the output of `Tanh` to a linear layer whose dimensionality is equal to the number of classes in the dataset (i.e., 2 in our case).

4. Then, a `LogSoftmax` layer on top of the outcome of `ReLU`.

5. Return the output of `LogSoftmax` layer.

Hint: `Tanh` might not be the ideal function to use, but we want you to explore it. (Usually `ReLU` works well).

rubric={accuracy:8, quality:4}

In [None]:
# your code goes here 


#### 1.7.2 Write code to Instantiate the model class with aforementioned hyper-parameters and print out the model architecture.
rubric={accuracy:3}

Hint: Your output should generally look like the following (with correct names of the model, functions, etc.):
```
modelName (
  (embedding): Embedding(xxx, xxx)
  (lstm_rnn): MODEL(xxx, xxx, num_layers=xxx)
  (activation_fn): Tanh()
  (linear_layer): Linear(in_features=xxx, out_features=xxx, bias=True)
  (softmax_layer): somethingRelatedToSoftmax()
)
```

In [None]:
# your code goes here 


**Create an optimizer for training**

In [None]:
LEARNING_RATE = 0.1
criterion = nn.NLLLoss()
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#### 1.8.1 How many learnable (or updatable) parameters are present in the model defined in 1.7. Compute the result by writing code.
rubric={accuracy:1}

In [None]:
# your code goes here 


#### 1.8.2 OPTIONAL QUESTION: How many megabyte memory will this model use? Please show your work. 

Hint1: Each parameter is a `torch.float32` tensor which is 32-bit floating point. 

Hint2: All the parameters of this model are learnable (or updatable) parameters.

rubric={spark:2}

#### Write your answer here.


**To facilitate your work for the next questions (1.9 and 1.10), we provide two functions for training and evaluation.**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return accuracy,f1score

#### 1.9 Please read [Pytorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html) and write code to load the trained model checkpoint from directory: `./ckpt/model_24.pt`.

Note this ckpt is a dictionary which includes three keys: epoch, model_state_dict, and optimizer_state_dict.
The model parameters are the values of "model_state_dictm".



rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.10 Write code to evaluate the trained model on test set and report the test accuracy and F-score. 
Report the performance of the trained model on the test set is XX.XXX% on accuracy (fill in your accuracy).

Report the performance of the trained model on the test set is XX.XXX on F1_score (fill in your F1_score).

rubric={accuracy:2,quality:1}

In [None]:
# your code goes here 


# 2 Short Answer Questions

### 2.1 Non-Linearity Review
rubric={reasoning:3}

There are a number of different non-linearities that can be used in our neural networks, for instance: **Sigmoid, tanh, and ReLU**. Different variations and tweaks to these non-linearities get introduced fairly frequently, and it pays to have a sense of why you might pick one non-linearity for your network instead of another. **Explain how these three (Sigmoid, tanh, and ReLU) are different and what might be some pros and cons of using them in a neural network**. (There are some great blog/quora posts talking about this topic, but if you are going to be summarizing from a post please include a link to the resource).


Hint: Your answer should be a maximum of 2-3 paragraphs. Short answers are just fine.

#### Write your answer here.

### 2.2 ReLU Variations
rubric={reasoning:1}

PyTorch supports several other non-linearities, find two variations on ReLU that PyTorch implements (see https://pytorch.org/docs/stable/nn.html) and explain how they are different from standard ReLU.


#### Write your answer here.


### 2.3 Wait why are we using Logs?
rubric={reasoning:1}

What is the purpose of taking logs of probabilities, as in the case of NLLLoss and LogSoftmax?

Hint: A short answer is just fine.


#### Write your answer here.

### 2.4 Softmax
rubric={reasoning:2}

For a multiclass problem with a Softmax layer, give an example of a hypothetical Softmax output with 3 classes (hint, think back to Lab1 for what that output might look like). Generally, which of the three classes in your example should your classifier pick? Why might we care about values corresponding to more than one of these classes?

Hint: Sometimes knowing what the top n most likely classes is of interest. Why?

#### Write your answer here.

### 2.5 RNNs
rubric={reasoning:1}

RNNs are quite powerful for dealing with certain types of data (such as sequences), however, they have some drawbacks. What are some of these drawbacks? (List two drawbacks)

Hint: Look at RNN slides, including the last few slides.


#### Write your answer here.

### 2.6 LSTMs vs RNNs

rubric={reasoning:2}

LSTMs help alleviate one of the issues with RNNs (see 2.5). What do they alleviate? Do you see any problems that the model might still have (just reason generally about the model based on your understanding).

Hint: Short answer is good.


#### Write your answer here.