## Homework 2 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- More linearities, non-linearities
- Embedding layer
- Neural Network for sentiment analysis
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: 1/23/21 11:59pm Pacific Time

## Getting Started

In [1]:
# all necessary imports
import numpy as np
import torch
import torch.nn as nn

# set the seed (allows reproducibility of the results)
manual_seed = 572
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # creates the device object, either GPU (cuda) or CPU

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### T2 Additional NN practice

Some Pytorch models expect certain dimensions of input. This can be tricky if they get mixed in with other model that expect something different (e.g. CNNs with RNNs say). Let's look at dealing with this: In the model below complete the forward pass by ensuring that the model passes the data correctly through the following CNN -> ReLU -> RNN -> Linear Layer -> Softmax.

Double check the Pytorch documentation to find the input/output dimensions for each of these modules! https://pytorch.org/docs/stable/index.html

In [22]:
#Assume (Batch,Length,Embeddings)
x = torch.rand((7,10,20))

class DimensionTestModel(nn.Module):
  
  def __init__(self, input_size, filters, hidden, output_size):
    super(DimensionTestModel, self).__init__()
    self.cnn = nn.Conv1d(input_size, filters, kernel_size=3, padding =1) # we'll look more at CNNs a little next lab and COLX 585
    self.activation = nn.ReLU()
    self.rnn = nn.RNN(filters, hidden)  # RNNs in detail next week
    self.linear_layer = nn.Linear(hidden, output_size) 
    self.softmax_layer = nn.LogSoftmax(dim=1)
  
  def forward(self, x):   #PRINT STATEMENTS ADDED TO ILLUSTRATE HOW THE DIMENSIONS CHANGE AS IT GOES THROUGH
    # Your Code Here
    print("Input Dim: " + str(x.shape))
    x = self.cnn(x.permute(0,2,1))
    print("CNN Layer: " + str(x.shape))

    x = self.activation(x)
    x, _ = self.rnn(x.permute(2,0,1)) 
    print("RNN Layer: " + str(x.shape))

    x = self.linear_layer(x.permute(1,0,2))
    print("FF Layer: " + str(x.shape))

    x = self.softmax_layer(x)
    # Your Code Here
    return x

model = DimensionTestModel(20,5,11,9)

print(model(x).shape)



Input Dim: torch.Size([7, 10, 20])
CNN Layer: torch.Size([7, 5, 10])
RNN Layer: torch.Size([10, 7, 11])
FF Layer: torch.Size([7, 10, 9])
torch.Size([7, 10, 9])


Permutations $[N,L,H] \dashrightarrow [N,H,L] \dashrightarrow [L,N,H] \dashrightarrow [N,L,H]$  ending with the same dimensions as the start, but compressing the Hidden/Embedding size from 20 to 9.  

Tricky things about this problem: You could have fed size $[L,N,H]$ into the linear layer (it only checks that the $H$ dimension matches), but this is incorrect as it would have applied the FF layer across the batch.  

What does this network do?? Well it's just a made up network to test your ability to get dimensions right. But if we wanted to think about it:    

The CNN layer will learn a representation with localized features (with kernel size 3, it can 'pay attention' to either side of each of the sequence segments), the RNN layer then can learn a bit more across the entire sequence, the linear layer then changes the hidden dimension size to match our output classes, finally the softmax (dim=1) applies a probability distribution across the entire sequence. Which would mean finding the part of the sequence most likely for each of the 9 output classes.  

Probably not useful as it is, but if you squint and replace the linear and softmax layer with something called connectionist temporal classification, you would have the basic skeleton of an early 2010s automatic speech recognition model.  

### Exercise 1 Initializing Weights
#### 1.1 Default Initialization


In [2]:
layer_1 = torch.nn.Linear(5, 4)
print("layer_1 weight:\n", layer_1.weight.data)

layer_1 weight:
 tensor([[-0.1824,  0.0148, -0.2221,  0.1687, -0.3811],
        [ 0.3278, -0.3251, -0.3556, -0.2826,  0.2025],
        [-0.1652,  0.1674, -0.3796, -0.2713, -0.1642],
        [-0.0879, -0.3412,  0.2928, -0.1055,  0.1436]])


From what distribution (and range) the default values in ``layer_1.weight.data`` are sampled from? 
(Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear) of ``nn.Linear``. And the symbols  $\textit{U}$ and $\mathcal{N}$ correspond to uniform and normal (Gaussian) distribution.)



**your answer goes here:**
initialized from $ `\textit{U}(-\sqrt{k}, \sqrt{k})` $, where
            $ `k = \frac{1}{\text{in\_features}}` $

### 1.2 Reproducibility

What happens when you add torch.manual_seed(manual_seed) before the layer is created in the above code? Run it a few times with and without.

**your answer goes here:**

Should keep the weights predictable.

### 1.3 Write code to initialize parameters (weights and biases) in ``layer_1`` (previous queston, 1.1) with numbers sampled randomly from standard normal distribution (mean 0 and variance 1).

Hint: Look at ``torch.randn`` function



In [7]:
layer_1.weight.data.normal_(0,1) # sets the values of weight matrix for 1st layer (W_1)
layer_1.bias.data.normal_(0,1)

print(layer_1.weight.data)
print(layer_1.bias.data)
#you could also solved this using torch.randn as:


layer_1.weight.data =torch.randn(4,5)
print(layer_1.weight.data)
layer_1.bias.data = torch.randn(1,4)
print(layer_1.bias.data)

tensor([[ 1.8576,  2.1321, -0.5056, -0.7988, -0.9592],
        [-1.2213, -0.9590,  1.0224, -1.1364, -0.3501],
        [ 0.1233,  1.6076, -1.4483,  1.2933,  1.1161],
        [ 0.3888,  1.3232,  1.1393,  0.8656,  0.5044]])
tensor([[ 1.7822,  1.9736, -0.3101, -0.8211]])
tensor([[-0.4055, -0.8368,  1.2277, -0.4297, -0.0306],
        [-0.0894, -0.1965, -0.9713,  0.2790, -0.7587],
        [ 0.5473,  0.4301,  0.8558,  1.6098, -1.1893],
        [ 1.1677,  0.6220,  2.5737, -0.6239, -1.2965]])
tensor([[-1.9029,  0.8260, -0.6644,  1.6663]])


## Exercise 2: Embedding layer


### 2.1

In [11]:
# the model
embedding_model = nn.Embedding(4, 3) # 4 embeddings with each of three dimensions

# set the weights for each of the four embedding
embedding_model.weight.data = torch.tensor([[1., 2., 3.], [1., 1., 1.], [3., 0., 0.], [10., 20., 30.]])

# data (2 examples each with two inputs)
inputs = torch.tensor([[0, 2], [1, 3]])

# forward propagation for computing average of input embeddings
embeddings_out = embedding_model(inputs)
embeddings_avg = embeddings_out.mean(1)
print("embeddings_avg:\n", embeddings_avg.data)

embeddings_avg:
 tensor([[  2.0000,   1.0000,   1.5000],
        [  5.5000,  10.5000,  15.5000]])


### Compute the values in **embeddings_avg** by hand. Show your work.
rubric={accuracy:2}

**your answer goes here:**

The inputs select the appropriate weight rows in the embedding model so for input $[0,2]$ we get:  

$([1,2,3]+[3,0,0])/2 = [2,1,1.5]$

$[1,3]$ we get:  
$([1,1,1]+[10,20,30])/2 = [5.5,10.5,15.5]$


### 2.2 Can you reimplement 2.1 by using [``nn.EmbeddingBag``](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) instead of ``nn.Embedding`` by setting the appropriate mode? 

Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#EmbeddingBag) of ``nn.EmbeddingBag``



In [12]:
# your code goes here

# same as above but replace Embedding with the following
embedding_model = nn.EmbeddingBag(4, 3, mode='mean')

# and get rid of 
embeddings_avg = embeddings_out.mean(1)



Toy corpus worth 3 sentences (each row excluding the header corresponds to a sentence)

|  sentence no. | sentence text |
| --------------- | ------------------------------ |
| 1  | UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart. |
| 2  | Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.|
| 3  | Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language. |

### 2.3 In the tutorial, we constructed word to index mapping for a one sentence corpus. Write code to build word to index mapping for this toy corpus containing three sentences.



In [10]:
# let us construct the word to index mapping
word2id = {}

all_tokens = []
sentences_list = ["UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart.",
            "Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.",
            "Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language."]
for sentence in sentences_list:
    for word in sentence.split():
        all_tokens.append(word)
types=list(set(all_tokens))

for word in types:
    if word not in word2id:
        word2id[word] = len(word2id)
print("*" *20 ," word2id dictionary ", "*" *20)
print(word2id,len(word2id))

********************  word2id dictionary  ********************
{'data.': 0, 'credential': 1, 'science': 2, 'in': 3, 'for': 4, 'combines': 5, 'campus,': 6, 'you': 7, 'foundational': 8, 'apart.': 9, 'passion': 10, 'turn': 11, '10': 12, 'those': 13, 'Vancouver': 14, 'linguistics': 15, 'Computational': 16, 'set': 17, 'knowledge': 18, 'Linguistics': 19, 'and': 20, 'interpret': 21, 'of': 22, 'AI': 23, 'Science': 24, 'can': 25, 'to': 26, 'is': 27, 'Over': 28, 'months,': 29, 'courses—equipping': 30, 'Data': 31, 'Master': 32, 'program': 33, 'UBC’s': 34, 'data': 35, 'Offered': 36, 'computational': 37, 'human': 38, 'at': 39, 'skills': 40, 'this': 41, 'unique': 42, 'language': 43, 'that': 44, 'degree': 45, 'build': 46, 'courses': 47, 'tailored': 48, 'with': 49, 'a': 50, 'the': 51, 'advanced': 52, 'language.': 53, 'into': 54, 'graduates': 55, 'language-related': 56} 57


### 2.4 In the tutorial, we constructed the train data (input and output for each training example) for a one sentence corpus. Write code that outputs the train data for CBOW model created from this toy corpus and prints the number of training examples. 

Note:
- Assume the **window size to be 3**. 
- Use **truecase** of the words. 
- Use [white space tokenizer](https://kite.com/python/docs/nltk.WhitespaceTokenizer) to get the words from each sentence and no further preprocessing. 
- A training example is generated from a sentence and doesn't span across multiple sentences. 



In [8]:
from torch.utils.data import Dataset, DataLoader
"""
create a dataset reader
"""
class CBOWDataset(Dataset):
  """ one-sentence dataset."""
  def __init__(self, window_size=2):
    # read the corpus
    corpus = ["UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart.",
            "Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.",
            "Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language."]

    window = 2 * window_size + 1
    token2id = {}
    # populate the word to id mapping and generate train inputs/targets
    # generate all training samples for this sentence
    train_inputs, train_outputs = [], []
    for sentence in corpus:
      tokens = sentence.strip().split()
      for token in tokens:
        if token not in token2id.keys(): # add new word
          token2id[token] = len(token2id) # new word index = length of token2id
    
      # map to index
      tokens = [token2id[token] for token in tokens]
    

      for num_win in range(len(tokens) - window + 1):
        cur_tokens = tokens[num_win:num_win + window]
        tgt_token = cur_tokens[window_size]
        train_inputs.append(cur_tokens[0:window_size]+cur_tokens[window_size+1:])
        train_outputs.append(tgt_token)
    self.token2id = token2id
    
    # set the vocab. size
    self.vocab_size = len(token2id)
    
    # set the total number of training examples
    self.n = len(train_inputs)
    
    # convert features and labels to torch.tensor
    self.features = torch.LongTensor(train_inputs, device=device)
    self.labels = torch.LongTensor(train_outputs, device=device)
    
  # return input and output of a single example
  # Input: Feature vectors, where each vector corresponds to a tweet. 
  # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
  def __getitem__(self, index):
    return self.features[index], self.labels[index]
  
  # return the total number of examples
  def __len__(self):
    return self.n

In [9]:
dataset = CBOWDataset(window_size = 3)
print("number of samples in the dataset:", dataset.n)
print("feature matrix:", dataset.features)
print("label matrix:", dataset.labels)
# create batch
train_loader = DataLoader(dataset=dataset, batch_size=1, shuffle=True, num_workers=1) 

number of samples in the dataset: 50
feature matrix: tensor([[ 0,  1,  2,  4,  5,  6],
        [ 1,  2,  3,  5,  6,  7],
        [ 2,  3,  4,  6,  7,  8],
        [ 3,  4,  5,  7,  8,  9],
        [ 4,  5,  6,  8,  9, 10],
        [ 5,  6,  7,  9, 10, 11],
        [ 6,  7,  8, 10, 11, 12],
        [ 7,  8,  9, 11, 12, 13],
        [ 8,  9, 10, 12, 13, 14],
        [15, 16,  9, 18, 19, 20],
        [16,  9, 17, 19, 20, 21],
        [ 9, 17, 18, 20, 21,  8],
        [17, 18, 19, 21,  8, 22],
        [18, 19, 20,  8, 22, 11],
        [19, 20, 21, 22, 11, 23],
        [20, 21,  8, 11, 23, 24],
        [21,  8, 22, 23, 24, 25],
        [ 8, 22, 11, 24, 25, 26],
        [22, 11, 23, 25, 26, 27],
        [11, 23, 24, 26, 27, 28],
        [23, 24, 25, 27, 28, 29],
        [24, 25, 26, 28, 29, 30],
        [31, 32, 33, 34, 35, 36],
        [32, 33,  9, 35, 36, 37],
        [33,  9, 34, 36, 37, 38],
        [ 9, 34, 35, 37, 38, 39],
        [34, 35, 36, 38, 39, 24],
        [35, 36, 37, 39, 24, 

### 2.5 Write code that
- defines the CBOW model
- train the CBOW model (update its parameters) with all the training examples (use SGD)
- prints the word embedding for one of the word involved in the training before and after training
- assumes the hyperparameters: EMBEDDING_SIZE to 3, LEARNING_RATE to 0.5, WINDOW_SIZE to 3 and MAX_EPOCHS to 1.



In [12]:
"""
create a model for CBOW
"""
class CBOWmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size):
    # In the constructor we define the layers for our model
    super(CBOWmodel, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_size, sparse=True)
    
    self.embedding.weight.data.normal_(0.0,0.05) # mean=0.0, mu=0.05
    
    self.linear_layer = nn.Linear(embedding_size, output_size, bias=False) # the layer will not learn an additive bias
    self.softmax_layer = nn.LogSoftmax(dim=1)
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.embedding(x).mean(1)
    out = self.linear_layer(out)
    out = self.softmax_layer(out)
    return out

In [15]:
# hyperparameter of CBOW model
EMBEDDING_SIZE = 3 # size of the word embedding
LEARNING_RATE = 0.5 # learning rate of gradient descent
WINDOW_SIZE = 3  # number of words to be considerd before (or after) the target word for making the context
MAX_EPOCHS = 1 # number of passes over the training data

In [16]:
# define the loss function (last node of the graph)
model = CBOWmodel(EMBEDDING_SIZE, dataset.vocab_size, dataset.vocab_size)
model.to(device)
print(model)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

CBOWmodel(
  (embedding): Embedding(57, 3, sparse=True)
  (linear_layer): Linear(in_features=3, out_features=57, bias=False)
  (softmax_layer): LogSoftmax()
)


In [18]:
# train logic (similar to that of linear regression model)
def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_batches = 0
    for batch in loader:
        # load the current batch
        batch_input, batch_output = batch

        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_batches += 1
    return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
    accuracy, num_examples = 0.0, 0
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
            # load the current batch
            batch_input, batch_output = batch
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch (row-wise)
            _, predicted = torch.max(model_outputs.data, 1) # Returns a (values, indices) 
            # compare with batch_output (gold labels) to compute accuracy
            accuracy += (predicted == batch_output).sum().item()
            num_examples += batch_output.size(0)
    return accuracy/num_examples

In [19]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)   
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss))

Epoch [1/1], Loss: 4.0420


### 2.6 How many learnable (or updatable) parameters are present in the model defined in 2.3.3. Compute the result by writing code or by hand.


In [16]:
# your code goes here (if you are writing code, use this block. Otherwise, change this block to a Markdown block.)

In [21]:
# Python code to get the same result
count = 0

for p in model.parameters():
    count += p.numel()
print("the number of parameters:",count)

the number of parameters: 342


## Exercise 3: Neural Network for Sentiment Analysis


The multilayer neural network code used in our tutorial is as follows:

In [22]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

# hyperparameters
BATCH_SIZE = 5
MAX_EPOCHS = 15
LEARNING_RATE = 0.1
MAX_FEATURES = 5000 # x_j, the number of j
NUM_CLASSES = 3

# dataset
DATA_FOLDER = "data/sentiment-twitter-2016-task4"
TRAIN_FILE = DATA_FOLDER + "/train.tsv"
VALID_FILE = DATA_FOLDER + "/dev.tsv"
TEST_FILE = DATA_FOLDER + "/test.tsv"

from sklearn.feature_extraction.text import TfidfVectorizer

# function for reading tsv file
def read_corpus(file):
    corpus = [] 
    for line in open(file):
        content, label = line.strip().split("\t") # first column is tweet, second column is golden label.
        corpus.append(content)
    return corpus

# reads the train corpus
train_corpus = read_corpus(TRAIN_FILE)  # get a list of tweets

# define the vectorizer
# builds a vocabulary that only considering the top max_features ordered by term frequency across the corpus.
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES,ngram_range=(1, 3))

# fit the vectorizer on train set
vectorizer.fit(train_corpus)

# create a new class inheriting torch.utils.data.Dataset
class TweetSentimentDataset(Dataset):
  """ sentiment-twitter-2016-task4 dataset."""
  def __init__(self, file, vectorizer):
    # read the corpus
    corpus, labels = [], []
    for line in open(file):
      content, label = line.strip().split("\t")
      corpus.append(content)
      labels.append(int(label))
    
    # set the size of the corpus
    self.n = len(corpus)
    
    # vectorize all the tweets
    features = vectorizer.transform(corpus)
    
    # convert features and labels to torch.tensor
    self.features = torch.from_numpy(features.toarray()).float()
    self.features.to(device)
    self.labels = torch.tensor(labels, device=device, requires_grad=False)
    
  # return input and output of a single example
  # Input: Feature vectors, where each vector corresponds to a tweet. 
  # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
  def __getitem__(self, index):
    return self.features[index], self.labels[index]
  
  # return the total number of examples
  def __len__(self):
    return self.n

# create the dataloader object
train_loader = DataLoader(dataset=TweetSentimentDataset(TRAIN_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=True)#, num_workers=2) 
valid_loader = DataLoader(dataset=TweetSentimentDataset(VALID_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False)#, num_workers=1)
test_loader = DataLoader(dataset=TweetSentimentDataset(TEST_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False)#, num_workers=1) 

# train logic (similar to that of linear regression model)
def train(loader):
  total_loss = 0.0
  # iterate throught the data loader
  num_batches = 0
  for batch in loader:
    # load the current batch
    batch_input, batch_output = batch
    
    # forward propagation
    # pass the data through the model
    model_outputs = model(batch_input.to(device))
    # compute the loss
    cur_loss = criterion(model_outputs, batch_output.to(device))
    total_loss += cur_loss.item()
    
    # backward propagation (compute the gradients and update the model)
    # clear the buffer
    optimizer.zero_grad()
    # compute the gradients
    cur_loss.backward()
    # update the weights
    optimizer.step()
    
    num_batches += 1
  return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
  accuracy, num_examples = 0.0, 0
  with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
    for batch in loader:
      # load the current batch
      batch_input, batch_output = batch
      # forward propagation
      # pass the data through the model
      model_outputs = model(batch_input.to(device))
      # identify the predicted class for each example in the batch
      _, predicted = torch.max(model_outputs.data, 1)
      # compare with batch_output (gold labels) to compute accuracy
      accuracy += (predicted == batch_output).sum().item()
      num_examples += batch_output.size(0)
  return accuracy/num_examples

"""
create a custom model class inheriting torch.nn.Module
"""
class MultiLayerNeuralNetworkModel(nn.Module):
  
  def __init__(self, num_inputs, hidden_layers, num_outputs):
    # In the constructor we define the layers for our model
    super(MultiLayerNeuralNetworkModel, self).__init__()
    
    modules = [] # stores all the layers for the neural network
    input_dim = num_inputs
    # add input layer followed by hidden layers (excluding the classification module)
    for hidden_layer in hidden_layers:
      # add one layer followed by non-linearity (nn.Sigmoid)
      modules.append(nn.Linear(input_dim, hidden_layer))
      #modules.append(nn.Tanh())
      modules.append(nn.ReLU())
      input_dim = hidden_layer
    # add the classification module
    modules.append(nn.Linear(input_dim, num_outputs))
    modules.append(nn.LogSoftmax(dim=1))
    
    # create the model from all the modules
    self.model = nn.Sequential(*modules) # container of layers, for more details: https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.model(x)
    return out

# hyperparameter of neural network
hidden_layers = [50, 50]  # [num. of hidden units in first layer, num. of hidden units in second layer]
#hidden_layers = [100, 100,100,100,100] 
# define the loss function (last node of the graph)
model = MultiLayerNeuralNetworkModel(MAX_FEATURES, hidden_layers, NUM_CLASSES)
model.to(device)
print(model)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

MultiLayerNeuralNetworkModel(
  (model): Sequential(
    (0): Linear(in_features=5000, out_features=50, bias=True)
    (1): ReLU()
    (2): Linear(in_features=50, out_features=50, bias=True)
    (3): ReLU()
    (4): Linear(in_features=50, out_features=3, bias=True)
    (5): LogSoftmax()
  )
)


In [23]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)
  # compute the training accuracy
  train_acc = evaluate(train_loader)
  # compute the validation accuracy 
  val_acc = evaluate(valid_loader)
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/15], Loss: 0.9924, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [2/15], Loss: 0.9652, Training Accuracy: 0.5517, Validation Accuracy: 0.4487
Epoch [3/15], Loss: 0.8733, Training Accuracy: 0.6648, Validation Accuracy: 0.5143
Epoch [4/15], Loss: 0.8072, Training Accuracy: 0.5840, Validation Accuracy: 0.4507
Epoch [5/15], Loss: 0.7601, Training Accuracy: 0.7340, Validation Accuracy: 0.5033
Epoch [6/15], Loss: 0.7045, Training Accuracy: 0.7255, Validation Accuracy: 0.4907
Epoch [7/15], Loss: 0.6577, Training Accuracy: 0.8053, Validation Accuracy: 0.5048
Epoch [8/15], Loss: 0.6108, Training Accuracy: 0.8378, Validation Accuracy: 0.4887
Epoch [9/15], Loss: 0.5653, Training Accuracy: 0.7482, Validation Accuracy: 0.4462
Epoch [10/15], Loss: 0.5189, Training Accuracy: 0.8258, Validation Accuracy: 0.4867
Epoch [11/15], Loss: 0.4812, Training Accuracy: 0.7813, Validation Accuracy: 0.4462
Epoch [12/15], Loss: 0.4474, Training Accuracy: 0.8922, Validation Accuracy: 0.4977
E

### 3.1 In the original tutorial, we considered only *unigrams* as features to represent a tweet. Change the original tutorial code to consider *bigrams, trigrams* (along with unigrams, which is considered by default) as features to represent a tweet.

Hints: 
- Look at the documentation of [``TfidfVectorizer``](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to see how to incorporate bigrams, trigrams and so on.
- Modify the line ``vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)`` and keep the rest of the code intact.

### **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)

rubric={accuracy:3, quality:1}

**Report performance of your model here:** (double-click to edit)

**3.1.1 Performance of my neural network model on the validation set after training is ~51.58%** on accuracy (fill in your accuracy).

**3.1.2 Your code:**

In [19]:
# your code goes here (put only the changed lines)
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES,ngram_range=(1, 3))


### 3.2  In the original tutorial, we considered only two hidden layers with 50 units each. Change the original tutorial code to consider *five hidden layers* with *100 dimensions each*.

Hints: 
- Modify the line ``hidden_layers = [50, 50]`` and keep the rest of the code intact.

### **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)

rubric={accuracy:3, quality:1}

**Report performance of your model here:** (double-click to edit)

**3.2.1  Performance of my neural network model on the validation set after training is ~50.43%** on accuracy (fill in your accuracy).

**3.2.2 Your code:**

In [20]:
# your code goes here (put only the changed lines)
hidden_layers = [100, 100,100,100,100] 

### 3.3 In the original tutorial, we used [*Sigmoid*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid) as the activation function. Change the original tutorial code to consider other nonlinearities such as [*ReLU*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu) and [*Tanh*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh) and report the nonlinearity that gives the best validation performance.

Hints: 
- Modify the line ``modules.append(nn.Sigmoid())`` and keep the rest of the code intact.

### **Hand in the**
- Nonlinearity function that gives the best validation performance
- Accuracy on the validation set after training with the best nonlinearity
- Python code (ONLY the changed lines)

rubric={accuracy:4, quality:1}

**Report your results here:** (double-click to edit)

**3.3.1. Report the nonlinearity function that gives the best validation performance: nn.ReLU()** There is a chance with your random state to have gotten nn.Tanh() as your best performance.

**3.3.2. Report performance of my neural network model on the validation set after training is ~51.43%** on accuracy (fill in your accuracy).

**3.3.3 Your code:**

In [12]:
# your code goes here (put only the changed lines)
modules.append(nn.Tanh())

#or

modules.append(nn.ReLU())


### 3.4 In the original tutorial, we used *learning rate of 0.5*. Change the original tutorial code by trying out different learning rates preferably *between 0.0001 and 1*. Report the learning rate that gives the best validation performance.

Hints: 
- Modify the line ``LEARNING_RATE = 0.5`` and keep the rest of the code intact.

### **Hand in the**
- Learning rate that gives the best validation performance
- Accuracy on the validation set after training with the best learning rate
- Python code (ONLY the changed lines)

rubric={accuracy:6, quality:1}

**Report your resuls here:** (double-click to edit)

**3.4.1. Report the learning rate that gives the best validation performance: This depends, but should be around ~.7

**3.4.2. Report the performance of my neural network model on the validation set after training is ~52%** on accuracy (fill in your accuracy).

**3.4.3 Your code:**

This answer depends on what learning rates you tried. You should be getting ~52%ish

## Exercise 4: Very-Short answer questions

(Double-click each question block and place your answer at the end of the question) 

### 4.1 Can we build deep neural network without any nonlinearity layers? What's the nature of such deep neural network without any nonlinearity layers?
rubric={reasoning:2}


I mean they aren't really neural networks without the nonlinearity, they are just fancy ways of doing linear regression since they need the nonlinearity to be able to deal with non-linearly separable data.

### 4.2 What is the distinction between single layer neural network and multilayer neural network?
rubric={reasoning:2}

Multilayer networks can have multiple hidden layers.

### 4.3 What is the difference between hidden layer and hidden unit (or neuron)?
rubric={reasoning:2}

Layers are made up of units.

### 4.4 What is the gradient of sigmoid function over a scalar ($\nabla \sigma(a)$)? What is the interesting property of this gradient?
rubric={reasoning:2}

Sigmoid's gradient is flat near the edges ($+ \infty$ or $-\infty$ and very steep near the center (x=0) this means things will be pushed to either +1 or 0. Another interesting property is that it has some interesting form when you take the derivative. Derivative of sigmoid $\sigma'(x)=\sigma(x)(1-\sigma(x))$