## Homework 3 - Supervised Learning II - MDS Computational Linguistics

## T3 Helper Code


In [1]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
from torchtext.data import Iterator, BucketIterator

In [2]:
# set the seed
manual_seed = 572
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### T3 - CNN's for text classification

Convolutional Neural Networks (CNNs) are essentially a special case of a normal feed forward network, instead of being "densely" connected (note you may see people refer to Feed Forward networks as "dense layers"), nodes in CNNs connect to a smaller set of nodes, defined by the "filters size" of the network. CNNs thus have a smaller local windows to look at data, but make up for it by generally using many additional filters, which might be able to learn different aspects of the data. These networks turn out to be extremely useful for processing images, audio, and anything with some sort of spatial properties to the data.

It turns out they can also be used for any sentence classification task. Words in a sentence it turns out have a sort of 1D spatial ordering, which means some classification tasks can benefit from this CNNs ability to operate over the length of the sequence. In addition, because of the sparsity of the connections, you end up being able to make much smaller networks that retain a great deal of power.

<img src="images/cnn.png" alt="cnn" title="CNN vs FF-net comparison" />

<p style="text-align: center;">Top: A CNN with filter size of 3, Bottom: A Feedforward neural net. (From Goodfellow et al. 2016)</p>

This illustrates a single size 3 filter, your network might have many additional filters that would all pass over the length of the data, each one producing an output channel to be passed to the next part of the network.  

<img src="images/1dcnn.png" alt="1dcnn" title="Pytorch 1DCNN overview" />

*We'll get into 2D CNNs later (as an additional topic in COLX 585) but here is a quick taste for how you could use a 1D CNN in the context of text-based NLP tasks.*


#### Pytorch 1D CNNs and max-pooling

Here's a quick example of how these networks function:

In [5]:

x = torch.rand((2,5,10))   # batch size 2 with length 10 and 5 dim embedding.

in_dim = 5
filters = 4

cnn1d = nn.Conv1d(in_dim,filters,kernel_size=3,padding=1) 
max_pool = nn.MaxPool1d(kernel_size=2)
ad_max_pool = nn.AdaptiveMaxPool1d(1)
activation = nn.ReLU()
x = cnn1d(x)
x = activation(x)
print(x)
print("Take the highest value in each window using max pool")
print(max_pool(x))
print(ad_max_pool(x))


tensor([[[0.0000, 0.2302, 0.4976, 0.1462, 0.2659, 0.1160, 0.2331, 0.2770,
          0.1295, 0.3005],
         [0.2625, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000],
         [0.3008, 0.0134, 0.0624, 0.2059, 0.1582, 0.1532, 0.0704, 0.0551,
          0.0224, 0.0000],
         [0.1708, 0.0017, 0.0000, 0.0516, 0.0199, 0.0000, 0.1053, 0.0412,
          0.0722, 0.0000]],

        [[0.0933, 0.2024, 0.3714, 0.2602, 0.1871, 0.1673, 0.0322, 0.5629,
          0.1314, 0.5456],
         [0.0278, 0.0589, 0.0000, 0.1606, 0.0000, 0.0000, 0.1000, 0.0000,
          0.0000, 0.0000],
         [0.0788, 0.1129, 0.0000, 0.2778, 0.0000, 0.1022, 0.2206, 0.0000,
          0.2809, 0.0000],
         [0.1701, 0.1184, 0.0053, 0.0000, 0.0000, 0.0000, 0.0847, 0.0000,
          0.1222, 0.0000]]], grad_fn=<ReluBackward0>)
Take the highest value in each window using max pool
tensor([[[0.2302, 0.4976, 0.2659, 0.2770, 0.3005],
         [0.2625, 0.0000, 0.0000, 0.0000, 0.0000],
        

Here we've used a 1D CNN and max pooling to "summarize" our data, boiling a length 10 series to only 4 items. One could repeat combinations of CNNs, activation functions, with and without pooling layers to build up a network in the same way that we have worked with FF nets.

But first one quick check about 1D CNNs

#### 1D CNN parameters
rubric={reasoning:1}  
  
Including bias weights how many trainable parameters are in our example CNN (think about how many weights would be needed to operate over your kernel size and each segment of data)?  Explain your reasoning in your answer, and consider checking it by printing out those weights/biases.


*Your Answer Here*

#### Size through CNNs
rubric={accuracy:1, reasoning:1}

CNNs can be a little tricky because as the data passes through them the dimensions might change based on number of filters, padding, and two other things we won't talk about now: stride and dilation. MaxPooling also quickly can decrease the size of the data. This is useful especially as a way to "feature extract" or "dimensionality reduction" but it's important to make sure we get the right output dimensions. 

For an tensor with batch size of 10, embedding dimension of 25, and length 30 that is fed to a 1D CNN + Maxpool combination, what parameters of the 1D CNN (kernelsize, number of filters, padding) and MaxPool (kernel_size, padding) would we need to get a output that has an output embedding dimension of 5 and a length of 10?


*Your Answer Here*

In [None]:
## Test your answer by making a random tensor of the appropriate size, 
## passing it through your proposed CNN+Maxpool and printing out the final size.

## Your Code Here ##

#### 1D CNN Model for Sentiment Analysis
rubric={accuracy:3, quality:1}

Based on our tutorial code that we've used for RNNs, modify our network to use 1D cnns instead of FF networks. We'll search over the architecture in the next question, but for now just make it as a basic 2 layer CNN network with 5 filters kernel size of 3 and using ReLU activation (no max Pooling). 

In [6]:
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset

# define the white space tokenizer to get tokens
def tokenize_en(tweet):
    """
    Tokenizes English tweet from a string into a list of strings (tokens)
    """
    return tweet.strip().split()

# define the TorchText's fields
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token = None)


train, val, test = TabularDataset.splits(
    path="./data/sentiment-twitter-2016-task4/", # the root directory where the data lies
    train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
    format='tsv',
    skip_header=False, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
    fields=[('tweet', TEXT), ('label', LABEL)])

TEXT.build_vocab(train, min_freq=3) # builds vocabulary based on all the words that occur at least twice in the training set
LABEL.build_vocab(train)

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(4,64,64),
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=True
)

VOCAB_SIZE = len(TEXT.vocab.stoi)

WORD_VEC_SIZE=300
# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py

    


In [7]:
print(VOCAB_SIZE)

3330


In [8]:

class ConvNet(nn.Module):
  
  def __init__(self, layer_num, filtersize, filters,nonlin, output_size, VOCAB_SIZE,  WORD_VEC_SIZE):  #feel free to add additional parameters
    super(ConvNet, self).__init__()
    self.embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE, sparse=True)
    #your code here
    self.embedding.weight.data.normal_(0.0,0.05)
    self.layers = nn.ModuleList()
    self.nonlin = nonlin
    for i in range(layer_num):
        if i == 0:
            self.layers.append(nn.Conv1d(WORD_VEC_SIZE, filters, filtersize, padding=(filtersize-1)//2))
        else:
            self.layers.append(nn.Conv1d(filters,filters,filtersize, padding=(filtersize-1)//2))
        self.layers.append(self.nonlin)    
    self.max_layer = nn.AdaptiveMaxPool1d(1)
    self.output = nn.Linear(filters, output_size)
    self.softmax = nn.LogSoftmax(dim=1) 

  def forward(self, x):
    x = self.embedding(x.permute(1,0)).permute(0,2,1)
    for layer in self.layers:
        x = layer(x)
    x = self.max_layer(x).squeeze(dim=-1)
    x =self.softmax(self.output(x))
    
    return x

  

In [10]:

from sklearn.metrics import accuracy_score
def train(loader,model,criterion,optimizer,device):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader,model,criterion,device):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    return accuracy


In [11]:
import scipy.stats

LEARNING_RATE=.1
MAX_EPOCHS=10

def random_search(num_iter):
    results = []
    for i in range(num_iter):
        config = {
            #define hyperparameters here
            "layers": scipy.stats.randint.rvs(1,3),
            "filters": scipy.stats.randint.rvs(10,200)
        }
        
        print("new config")
        print(config)
        model = ConvNet(config["layers"],3,config["filters"],nn.ReLU(),output_size=3, VOCAB_SIZE=VOCAB_SIZE, WORD_VEC_SIZE=WORD_VEC_SIZE)
        model.to(device)
        criterion = nn.NLLLoss()
        optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

    
        max_val = 0
        best_epoch = 0
        for epoch in range(MAX_EPOCHS):
        # train the model for one pass over the data
            train_loss = train(train_iter,model,criterion,optimizer,device)  
        # compute the training accuracy
            train_acc = evaluate(train_iter,model,criterion,device)
        # compute the validation accuracy
            val_acc = evaluate(val_iter,model,criterion,device)
            if val_acc > max_val:
                max_val = val_acc
                best_epoch = epoch+1
        # print the loss for every epoch
            print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))
        results.append((max_val,best_epoch,config))
    return results

In [None]:
random_search(3)

new config
{'layers': 2, 'filters': 110}
Epoch [1/10], Loss: 0.2467, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [2/10], Loss: 0.2451, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [3/10], Loss: 0.2364, Training Accuracy: 0.5763, Validation Accuracy: 0.4712
Epoch [4/10], Loss: 0.2268, Training Accuracy: 0.6090, Validation Accuracy: 0.4742
Epoch [5/10], Loss: 0.2190, Training Accuracy: 0.6288, Validation Accuracy: 0.4747


#### 1D CNN Performance
rubric={accuracy:2, quality:1}

Based on our initial network we'd like to compare how depth matters vs number of filters in a given layer. For this section (consider spliting up the trials similar to Lab 1, with group members running a set on different random start states), perform a random search (total trials: 20) over the parameters space (number of layers, kernel size, number of filters, activation, etc.) and report your results (ranges searched, best performance) on the sentiment task. Hint: Feel free to use Skorch + sklearn + scipy.stats (for distributions)

In [None]:
# your code here.

#### 1D CNN summary
rubric={reasoning:1}

Based on the performance on the task how does the CNN compare to the Feed Forward network from Lab 2? What factors seemed to be most important to the performance of the model (number of filters, depth, size of filters...? ) 

* You answer here *

## Exercise 1: Building a Recurrent Neural Network

In this exercise, we provide a corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). HappyDB is a dataset of about 100,000 `happy moments` crowd-sourced via Amazon’s Mechanical Turk where each worker was asked to describe in a complete sentence `what made them happy in the past 24 hours`. Each user was asked to describe three such moments. 
In this exercise, we focus on sociality classification. We only use labelled dataset which include 10,562 labelled samples. 

We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``data/happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``.

#### 1.1 Write code to  define a `whitespace_tokenizer` whose input is a tweet text.

rubric={accuracy:1}

In [None]:
def whitespace_tokenize(text):
    # your code goes here 

#### 1.2 Wrote code to define `TorchText's Fields` handle how data should be processed.  
Hint: You need 2 Fields `TEXT` and `LABEL` for tweet text and label respectively. 

* The tokenizer is the whitespace_tokenizer in 1.1.
* Use the truecase of words. 

rubric={accuracy:2} 

In [None]:
# your code goes here 


#### 1.3 Write code to use `TabularDataset class`  and `Fields` to process the tsv files (train, dev, and test). 

Hint 1: `Fields` will call tokenizer, and convert tokens to numerical index. 

Hint 2: Use `TabularDataset.splits(...)` to load train, dev, and test sets.

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.4  Write code to build your vocabulary to map words and labels to integers. The maximum size of vocabulary is 5,000 (not count `<pad>` and `<unk>` in).
Hint: Use `build_vocab` to build vocabularies for `TEXT` field and `LABEL` field respectively. 

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.5 Write code to print the sizes of your two vocabularies (i.e., `TEXT` and `LABEL`) individually. 

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.6 Write code to construct the Iterators to get the train, dev, and test splits. Use `BucketIterator` to initialize the Iterators for the train, dev, and test data.
* apply same batch size of 32 on train, dev, and test set.
* Samples are sorted by length.
* Sort samples within each batch

Hint: Use `BucketIterator.splits(...)`

rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.7.1 Write code to create a class called `LSTMmodel` to define a classifier of the task. In this model, you should have:

1. An embedding layer with input dimention equal to the size of your `TEXT` vocabulary, and that represent each token in a 300-dimentional vector.  The parameters of this embedding layer should be randomly initialized with numbers sampled from a normal distribution (mean 0 and variance 0.05).

2. Two uni-directional `LSTM` layers. Each layer has 500 hidden unites.

3. Pass the last hidden of last layer into a `Tanh` activation function, then feed the output of `Tanh` to a linear layer whose dimensionality is equal to the number of classes in the dataset (i.e., 2 in our case).

4. Then, a `LogSoftmax` layer on top of the outcome of `linear layer`.

5. Return the output of `LogSoftmax` layer.

Hint: `Tanh` might not be the ideal function to use, but we want you to explore it. (Usually `ReLU` works well).

rubric={accuracy:8, quality:4}

In [None]:
# your code goes here 


#### 1.7.2 Write code to Instantiate the model class with aforementioned hyper-parameters and print out the model architecture.
rubric={accuracy:3}

Hint: Your output should generally look like the following (with correct names of the model, functions, etc.):
```
modelName (
  (embedding): Embedding(xxx, xxx)
  (lstm_rnn): MODEL(xxx, xxx, num_layers=xxx)
  (activation_fn): Tanh()
  (linear_layer): Linear(in_features=xxx, out_features=xxx, bias=True)
  (softmax_layer): somethingRelatedToSoftmax()
)
```

In [None]:
# your code goes here 


**Create an optimizer for training**

In [None]:
LEARNING_RATE = 0.1
criterion = nn.NLLLoss()
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#### 1.8.1 How many learnable (or updatable) parameters are present in the model defined in 1.7. Compute the result by writing code.
rubric={accuracy:1}

In [None]:
# your code goes here 


#### 1.8.2 OPTIONAL QUESTION: How many megabyte memory will this model use? Please show your work. 

Hint1: Each parameter is a `torch.float32` tensor which is 32-bit floating point. 

Hint2: All the parameters of this model are learnable (or updatable) parameters.

rubric={spark:2}

#### Write your answer here.


**To facilitate your work for the next questions (1.9 and 1.10), we provide two functions for training and evaluation.**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_sample = 0
    for batch in loader:
        # load the current batch
        batch_input = batch.tweet
        batch_output = batch.label
        
        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)
        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_sample += batch_output.shape[0]
    return total_loss/num_sample

# evaluation logic based on classification accuracy
def evaluate(loader):
    all_pred=[]
    all_label = []
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
             # load the current batch
            batch_input = batch.tweet
            batch_output = batch.label

            batch_input = batch_input.to(device)
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(model_outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(batch_output)
            
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return accuracy,f1score

#### 1.9 Please read [Pytorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html) and write code to load the trained model checkpoint from directory: `./ckpt/model_24.pt`.

Note this ckpt is a dictionary which includes three keys: epoch, model_state_dict, and optimizer_state_dict.
The model parameters are the values of "model_state_dictm".



rubric={accuracy:2}

In [None]:
# your code goes here 


#### 1.10 Write code to evaluate the trained model on test set and report the test accuracy and F-score. 
Report the performance of the trained model on the test set is XX.XXX% on accuracy (fill in your accuracy).

Report the performance of the trained model on the test set is XX.XXX on F1_score (fill in your F1_score).

rubric={accuracy:2,quality:1}

In [None]:
# your code goes here 


# 2 Short Answer Questions

### 2.1 Non-Linearity Review
rubric={reasoning:2}

There are a number of different non-linearities that can be used in our neural networks, for instance: **Sigmoid, tanh, and ReLU**. Different variations and tweaks to these non-linearities get introduced fairly frequently, and it pays to have a sense of why you might pick one non-linearity for your network instead of another. **Explain how these three (Sigmoid, tanh, and ReLU) are different and what might be some pros and cons of using them in a neural network**. (There are some great blog/quora posts talking about this topic, but **if you are going to be summarizing from a post or other online resource please include a link to the resource**).


Hint: Your answer should be a maximum of 2-3 paragraphs. Short answers are just fine.

#### Write your answer here.

### 2.2 ReLU Variations
rubric={reasoning:1}

PyTorch supports several other non-linearities, find two variations on ReLU that PyTorch implements (see https://pytorch.org/docs/stable/nn.html) and explain how they are different from standard ReLU. 

#### Write your answer here.


### 2.3 Wait why are we using Logs?
rubric={reasoning:1}

What is the purpose of taking logs of probabilities, as in the case of NLLLoss and LogSoftmax?

Hint: A short answer is just fine.


#### Write your answer here.

### 2.4 Softmax
rubric={reasoning:2}

For a multiclass problem with a Softmax layer why might we care about values corresponding to more than one of these classes?

Hint: Sometimes knowing what the top n most likely classes is of interest. Why?

#### Write your answer here.

### 2.5 RNNs
rubric={reasoning:1}

RNNs are quite powerful for dealing with certain types of data (such as sequences), however, they have some drawbacks. What are some of these drawbacks? (List two drawbacks)

Hint: Look at RNN slides, including the last few slides.


#### Write your answer here.

### 2.6 LSTMs vs RNNs

rubric={reasoning:2}

LSTMs help alleviate one of the issues with RNNs (see 2.5). What do they alleviate? Do you see any problems that the model might still have (just reason generally about the model based on your understanding).

Hint: Short answer is good.


#### Write your answer here.