## Homework 2 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- More linearities, non-linearities
- Embedding layer
- Neural Network for sentiment analysis
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: 1/23/21 12:59pm Pacific Time

## Getting Started

In [1]:
# all necessary imports
import numpy as np
import torch
import torch.nn as nn

# set the seed (allows reproducibility of the results)
manual_seed = 572
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # creates the device object, either GPU (cuda) or CPU
torch.backends.cudnn.deterministic=True
torch.set_deterministic(True) 

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### T2 Additional NN practice
rubric={accuracy:2, quality:1}

Some Pytorch models expect certain dimensions of input. This can be tricky if they get mixed in with other model that expect something different (e.g. CNNs with RNNs say). Let's look at dealing with this: In the model below complete the forward pass by ensuring that the model passes the data correctly through the following CNN -> ReLU -> RNN -> Linear Layer -> Softmax.

Double check the Pytorch documentation to find the input/output dimensions for each of these modules! https://pytorch.org/docs/stable/index.html

In [None]:
#Assume (Batch,Length,Embeddings)
x = torch.rand((7,10,20))

class DimensionTestModel(nn.Module):
  
  def __init__(self, input_size, filters, hidden, output_size):
    super(DimensionTestModel, self).__init__()
    self.cnn = nn.Conv1d(input_size, filters, kernel_size=3, padding =1) # we'll look more at CNNs a little next lab and COLX 585
    self.activation = nn.ReLU()
    self.rnn = nn.RNN(filters, hidden)  # RNNs in detail in lecture
    self.linear_layer = nn.Linear(hidden, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=1)
  
  def forward(self, x):
    # Your Code Here
    
    return 

model = DimensionTestModel(20,5,11,9)

print(model(x).shape)

### Exercise 1 Initializing Weights

Below is a linear layer with the weights printed out.


In [15]:
layer_1 = torch.nn.Linear(5, 4)
print("layer_1 weight:\n", layer_1.weight.data)

layer_1 weight:
 tensor([[ 0.2852,  0.0739,  0.3815, -0.2404,  0.4404],
        [ 0.0549, -0.4405,  0.2186,  0.2818, -0.3965],
        [-0.2733,  0.3812,  0.1627, -0.0800,  0.1522],
        [-0.3782, -0.4295,  0.1811,  0.1469, -0.2809]])



### 1.1 Initialized distribution
rubric={reasoning:1}

From what distribution (and range) the default values in ``layer_1.weight.data`` are sampled from? 
(Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear) of ``nn.Linear``. And the symbols  $\textit{U}$ and $\mathcal{N}$ correspond to uniform and normal (Gaussian) distribution.)



**Your answer here**

### 1.2 Reproducibility
rubric={reasoning:1}

What happens when you add torch.manual_seed(manual_seed) before the layer is created in the above code? Run it a few times with and without.

**your answer goes here:**


### 1.3 Parameter Initialization
rubric={accuracy:1}

Write code to initialize parameters (weights and biases) in ``layer_1`` (previous queston, 1.1) with numbers sampled randomly from standard normal distribution (mean 0 and variance 1).


Hint: Look at ``torch.randn`` function



In [2]:

#your code

## Exercise 2: Embedding layer


To get some intuition for how embedding layers work, let's just run through the expected results from passing some input into an embedding model.

Below is some code to do this (so the ouput should be your expected answer). For this problem, show how you'll get the expected output by just writing out the math that the model performs.

In [11]:
# the model
embedding_model = nn.Embedding(4, 3) # 4 embeddings with each of three dimensions

# set the weights for each of the four embedding
embedding_model.weight.data = torch.tensor([[1., 2., 3.], [1., 1., 1.], [3., 0., 0.], [10., 20., 30.]])

# data (2 examples each with two inputs)
inputs = torch.tensor([[0, 2], [1, 3]])

# forward propagation for computing average of input embeddings
embeddings_out = embedding_model(inputs)
embeddings_avg = embeddings_out.mean(1)
print("embeddings_avg:\n", embeddings_avg.data)

embeddings_avg:
 tensor([[  2.0000,   1.0000,   1.5000],
        [  5.5000,  10.5000,  15.5000]])


### 2.1 Computing Embeddings
rubric={accuracy:1}

Compute the values in **embeddings_avg** by hand. Show your work.

**your answer goes here:**  You can write \\$ **Math Equations Here** \\$ to use math formating




### 2.2 EmbeddingBag

Reimplement the Embedding model above by using [``nn.EmbeddingBag``](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) instead of ``nn.Embedding`` by setting the appropriate mode?  

Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#EmbeddingBag) of ``nn.EmbeddingBag``

rubric={accuracy:2}

In [12]:
# your code goes here

### 2.3 Word to Index Map
rubric={accuracy:2}

In the tutorial, we constructed word to index mapping for a one sentence corpus. Write code to build word to index mapping for this toy corpus containing three sentences.



Toy corpus worth 3 sentences (each row excluding the header corresponds to a sentence)

|  sentence no. | sentence text |
| --------------- | ------------------------------ |
| 1  | UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart. |
| 2  | Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.|
| 3  | Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language. |

In [5]:
sentences_list = ["UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart.",
            "Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.",
            "Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language."]


In [None]:
#your code here

### 2.4 Building Training Data for CBOW
rubric={accuracy:2}  
In the tutorial, we constructed the train data (input and output for each training example) for a one sentence corpus. Write code that outputs the train data for CBOW model created from this toy corpus and prints the number of training examples. 

Note:
- Assume the **window size to be 3**. 
- Use **truecase** of the words. 
- Use [white space tokenizer](https://kite.com/python/docs/nltk.WhitespaceTokenizer) to get the words from each sentence and no further preprocessing. 
- A training example is generated from a sentence and doesn't span across multiple sentences. 



In [8]:
from torch.utils.data import Dataset, DataLoader

## Hint: create a custom Dataset class to process the data and then use Dataloader.
class CBOWDataset(Dataset):
    def __init__(self, window_size=3):
        #to complete
            
    def __getitem__(self, index):
        #to complete
        
    def __len__(self):
        #to complete

In [None]:
dataset = CBOWDataset(window_size = 3)
print("number of samples in the dataset:", dataset.n)
print("feature matrix:", dataset.features)
print("label matrix:", dataset.labels)


#to complete
#create the batch to load the dataset into a dataloader

### 2.5 Build CBOW
rubric={accuracy:3, quality:2}

Write code that
- defines the CBOW model
- train the CBOW model (update its parameters) with all the training examples (use SGD)
- prints the word embedding for one of the word involved in the training before and after training
- assumes the hyperparameters: EMBEDDING_SIZE to 3, LEARNING_RATE to 0.5, WINDOW_SIZE to 3 and MAX_EPOCHS to 1.



In [None]:

class CBOWmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size):
    #to complete
  
  def forward(self, x):
    #to complete
    
    return out

In [15]:
# hyperparameter of CBOW model
EMBEDDING_SIZE = 3 # size of the word embedding
LEARNING_RATE = 0.5 # learning rate of gradient descent
WINDOW_SIZE = 3  # number of words to be considerd before (or after) the target word for making the context
MAX_EPOCHS = 1 # number of passes over the training data

Complete the code to instantiate your model, loss function, and optimizer.

In [7]:


# instantiate model  (often nice to also print out the model here to see that all the layers are correct)  

# define the loss function

# create an instance of SGD with required hyperparameters


In [8]:
# train logic (similar to that of linear regression model)
def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_batches = 0
    for batch in loader:
    #TO COMPLETE: (Use the comments to guide your code)
        
        # load the current batch

        
        # forward propagation:
        # pass the data through the model
        
        # compute the loss


        # backward propagation (compute the gradients and update the model):
        # clear the buffer
        
        # compute the gradients
        
        # update the weights

        num_batches += 1
    return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
    accuracy, num_examples = 0.0, 0
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
        #TO COMPLETE (USE THE COMMENTS TO GUIDE YOUR CODE):
            # load the current batch

            # forward propagation:
            # pass the data through the model

            # identify the predicted class for each example in the batch (row-wise)

            # compare with batch_output (gold labels) to compute accuracy

            num_examples += batch_output.size(0)
    return accuracy/num_examples

In [None]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)   
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss))

### 2.6 How many learnable (or updatable) parameters are present in the model defined in 2.5. Compute the result by writing code or by hand.
rubric={accuracy:2}

In [16]:
# your code goes here (if you are writing code, use this block. Otherwise, change this block to a Markdown block.)


## Exercise 3: Neural Network for Sentiment Analysis


The multilayer neural network code used in our tutorial is as follows:

In [22]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed
torch.manual_seed(manual_seed)


# hyperparameters
BATCH_SIZE = 5
MAX_EPOCHS = 15
LEARNING_RATE = 0.5
MAX_FEATURES = 5000 # x_j, the number of j
NUM_CLASSES = 3

# dataset
DATA_FOLDER = "data/sentiment-twitter-2016-task4"
TRAIN_FILE = DATA_FOLDER + "/train.tsv"
VALID_FILE = DATA_FOLDER + "/dev.tsv"
TEST_FILE = DATA_FOLDER + "/test.tsv"

from sklearn.feature_extraction.text import TfidfVectorizer

# function for reading tsv file
def read_corpus(file):
    corpus = [] 
    for line in open(file):
        content, label = line.strip().split("\t") # first column is tweet, second column is golden label.
        corpus.append(content)
    return corpus

# reads the train corpus
train_corpus = read_corpus(TRAIN_FILE)  # get a list of tweets

# define the vectorizer
# builds a vocabulary that only considering the top max_features ordered by term frequency across the corpus.
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)

# fit the vectorizer on train set
vectorizer.fit(train_corpus)

# create a new class inheriting torch.utils.data.Dataset
class TweetSentimentDataset(Dataset):
  """ sentiment-twitter-2016-task4 dataset."""
  def __init__(self, file, vectorizer):
    # read the corpus
    corpus, labels = [], []
    for line in open(file):
        content, label = line.strip().split("\t")
        corpus.append(content)
        labels.append(int(label))
    
    # set the size of the corpus
    self.n = len(corpus)
    
    # vectorize all the tweets
    features = vectorizer.transform(corpus)
    
    # convert features and labels to torch.tensor
    self.features = torch.from_numpy(features.toarray()).float()
    self.features.to(device)
    self.labels = torch.tensor(labels, device=device, requires_grad=False)
    
  # return input and output of a single example
  # Input: Feature vectors, where each vector corresponds to a tweet. 
  # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
  def __getitem__(self, index):
    return self.features[index], self.labels[index]
  
  # return the total number of examples
  def __len__(self):
    return self.n

# create the dataloader object
train_loader = DataLoader(dataset=TweetSentimentDataset(TRAIN_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=True, num_workers=2) 
valid_loader = DataLoader(dataset=TweetSentimentDataset(VALID_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1)
test_loader = DataLoader(dataset=TweetSentimentDataset(TEST_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1) 

# train logic (similar to that of linear regression model)
def train(loader):
    total_loss = 0.0
  # iterate throught the data loader
    num_batches = 0
    for batch in loader:
    # load the current batch
        batch_input, batch_output = batch
    
    # forward propagation
    # pass the data through the model
        model_outputs = model(batch_input)
    # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()
    
    # backward propagation (compute the gradients and update the model)
    # clear the buffer
        optimizer.zero_grad()
    # compute the gradients
        cur_loss.backward()
    # update the weights
        optimizer.step()
    
        num_batches += 1
    return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
    accuracy, num_examples = 0.0, 0
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
        # load the current batch
            batch_input, batch_output = batch
              # forward propagation
              # pass the data through the model
            model_outputs = model(batch_input)
              # identify the predicted class for each example in the batch
            _, predicted = torch.max(model_outputs.data, 1)
              # compare with batch_output (gold labels) to compute accuracy
            accuracy += (predicted == batch_output).sum().item()
            num_examples += batch_output.size(0)
    return accuracy/num_examples

"""
create a custom model class inheriting torch.nn.Module
"""
class MultiLayerNeuralNetworkModel(nn.Module):
  
  def __init__(self, num_inputs, hidden_layers, num_outputs):
    # In the constructor we define the layers for our model
    super(MultiLayerNeuralNetworkModel, self).__init__()
    
    modules = [] # stores all the layers for the neural network
    input_dim = num_inputs
    # add input layer followed by hidden layers (excluding the classification module)
    for hidden_layer in hidden_layers:
      # add one layer followed by non-linearity (nn.Sigmoid)
        modules.append(nn.Linear(input_dim, hidden_layer))
        modules.append(nn.Sigmoid())
        input_dim = hidden_layer
    # add the classification module
    modules.append(nn.Linear(input_dim, num_outputs))
    modules.append(nn.LogSoftmax(dim=1))
    
    # create the model from all the modules
    self.model = nn.Sequential(*modules) # container of layers, for more details: https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.model(x)
    return out

# hyperparameter of neural network
hidden_layers = [50, 50]  # [num. of hidden units in first layer, num. of hidden units in second layer]

# define the loss function (last node of the graph)
model = MultiLayerNeuralNetworkModel(MAX_FEATURES, hidden_layers, NUM_CLASSES)
model.to(device)
print(model)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

MultiLayerNeuralNetworkModel(
  (model): Sequential(
    (0): Linear(in_features=5000, out_features=50, bias=True)
    (1): ReLU()
    (2): Linear(in_features=50, out_features=50, bias=True)
    (3): ReLU()
    (4): Linear(in_features=50, out_features=3, bias=True)
    (5): LogSoftmax()
  )
)


In [23]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)
  # compute the training accuracy
  train_acc = evaluate(train_loader)
  # compute the validation accuracy 
  val_acc = evaluate(valid_loader)
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/15], Loss: 0.9924, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [2/15], Loss: 0.9652, Training Accuracy: 0.5517, Validation Accuracy: 0.4487
Epoch [3/15], Loss: 0.8733, Training Accuracy: 0.6648, Validation Accuracy: 0.5143
Epoch [4/15], Loss: 0.8072, Training Accuracy: 0.5840, Validation Accuracy: 0.4507
Epoch [5/15], Loss: 0.7601, Training Accuracy: 0.7340, Validation Accuracy: 0.5033
Epoch [6/15], Loss: 0.7045, Training Accuracy: 0.7255, Validation Accuracy: 0.4907
Epoch [7/15], Loss: 0.6577, Training Accuracy: 0.8053, Validation Accuracy: 0.5048
Epoch [8/15], Loss: 0.6108, Training Accuracy: 0.8378, Validation Accuracy: 0.4887
Epoch [9/15], Loss: 0.5653, Training Accuracy: 0.7482, Validation Accuracy: 0.4462
Epoch [10/15], Loss: 0.5189, Training Accuracy: 0.8258, Validation Accuracy: 0.4867
Epoch [11/15], Loss: 0.4812, Training Accuracy: 0.7813, Validation Accuracy: 0.4462
Epoch [12/15], Loss: 0.4474, Training Accuracy: 0.8922, Validation Accuracy: 0.4977
E

### 3.1 Bigram and Trigrams
rubric={accuracy:1}  
In the original tutorial, we considered only *unigrams* as features to represent a tweet. Change the original tutorial code to consider *bigrams, trigrams* (along with unigrams, which is considered by default) as features to represent a tweet.
Hints: 
- Look at the documentation of [``TfidfVectorizer``](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to see how to incorporate bigrams, trigrams and so on.
- Modify the line ``vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)`` and keep the rest of the code intact.

### **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)



**Report performance of your model here:** (double-click to edit)

**3.1.1 Performance of my neural network model on the validation set after training is** (fill in your accuracy).

**3.1.2 Your code:**

In [19]:
# your code goes here (put only the changed lines)


### 3.2  In the original tutorial, we considered only two hidden layers with 50 units each. Change the original tutorial code to consider *five hidden layers* with *100 dimensions each*.
rubric={accuracy:1}  
Hints: 
- Modify the line ``hidden_layers = [50, 50]`` and keep the rest of the code intact.

### **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)



**Report performance of your model here:** (double-click to edit)

**3.2.1  Performance of my neural network model on the validation set after training is**  (fill in your accuracy).

**3.2.2 Your code:**

In [4]:
# your code goes here (put only the changed lines)


### 3.3 In the original tutorial, we used [*Sigmoid*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid) as the activation function. Change the original tutorial code to consider other nonlinearities such as [*ReLU*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu) and [*Tanh*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh) and report the nonlinearity that gives the best validation performance.
rubric={accuracy:1}  
Hints: 
- Modify the line ``modules.append(nn.Sigmoid())`` and keep the rest of the code intact.

### **Hand in the**
- Nonlinearity function that gives the best validation performance
- Accuracy on the validation set after training with the best nonlinearity
- Python code (ONLY the changed lines)



**Report your results here:** (double-click to edit)

**3.3.1. Report the nonlinearity function that gives the best validation performance:**  

**3.3.2. Report performance of my neural network model on the validation set after training is:**  accuracy (fill in your accuracy).

**3.3.3 Your code:**

In [3]:
# your code goes here (put only the changed lines)

### 3.4 In the original tutorial, we used *learning rate of 0.5*. Change the original tutorial code by trying out different learning rates preferably *between 0.0001 and 1*. Report the learning rate that gives the best validation performance.
rubric={accuracy:1}  
Hints: 
- Modify the line ``LEARNING_RATE = 0.5`` and keep the rest of the code intact.

### **Hand in the**
- Learning rate that gives the best validation performance
- Accuracy on the validation set after training with the best learning rate
- Python code (ONLY the changed lines)



**Report your results here:** (double-click to edit)

**3.4.1. Report the learning rate that gives the best validation performance:**  
**3.4.2. Report the performance of my neural network model on the validation set after training is :** (fill in your accuracy).

**3.4.3 Your code:**

This answer depends on what learning rates you tried. You should be getting ~52%ish

## Exercise 4: Very-Short answer questions

(Double-click each question block and place your answer at the end of the question) 

### 4.1 Can we build deep neural network without any nonlinearity layers? What's the nature of such deep neural network without any nonlinearity layers?
rubric={reasoning:1}




### 4.2 What is the distinction between single layer neural network and multilayer neural network?
rubric={reasoning:1}



### 4.3 What is the difference between hidden layer and hidden unit (or neuron)?
rubric={reasoning:1}



### 4.4 What is the gradient of sigmoid function over a scalar ($\nabla \sigma(a)$)? What is the interesting property of this gradient?
rubric={reasoning:1}

