## Homework 2 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- More linearities, non-linearities
- Embedding layer
- Neural Network for sentiment analysis
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: January 25, 2020, 18:00:00 (Vancouver time)

## Getting Started

In [1]:
# all necessary imports
import numpy as np
import torch
import torch.nn as nn

# set the seed (allows reproducibility of the results)
manual_seed = 123
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # creates the device object, either GPU (cuda) or CPU

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

## Background

In the previous assignment, you had computed the values in tensors by hand.

Sample question:

In [2]:
# the model
linear_layer = torch.nn.Linear(5, 2)

# set the parameters (weights, biases)
linear_layer.weight.data = torch.tensor([[1., 2., 3., 4., 5.], [1., 3., 0., 0., 10.]]) # sets the values of weight matrix (W)
linear_layer.bias.data = torch.tensor([3., 1.]) # sets the values of bias vector (b) (note unlike previous assignment, we assign different bias for each output unit)

# data (2 examples each with 5 input features and 2 target values)
inputs = torch.tensor([[100., 10., 20., 15., 1.], [10., 5., 2., 1., 0.]]) # initialize the inputs (X)
targets = torch.tensor([[245., 140.], [30., 30.]]) # initialize the targets (Y)

# forward propagation
model_out = linear_layer(inputs)
criterion = torch.nn.MSELoss()
loss_out = criterion(model_out, targets)

print("model loss (loss_out):", loss_out.data.item())

model loss (loss_out): 8.75


Compute the values in ``loss_out`` by hand. Show your work.

Sample answer: (write it in markdown, not as code. if you don't like markdown, you can write the steps in a piece of paper, take a photo and attach an image in the answer block)

your answer goes here:

First let's compute the $model\_out$ for our inputs:

$XW^T + b  = [[100,10,20,15,1],[10,5,2,1,0]] \times [[1,2,3,4,5],[1,3,0,0,10]]^T  + [[3, 1], [3, 1]]$

$= [[(100*1+10*2+20*3+15*4+1*5+3),(100*1+10*3+20*0+15*0+1*10+1)],[(10*1+5*2+2*3+1*4+0*5+3),(10*1+5*3+2*0+1*0+0*10+1)]] $

$= [[248,141],[33,26]]$

Now let's apply the mean squared error loss to compute $loss\_out$:
$\frac{1}{n}\sum_i^n(\tilde{y_i} -y_i)^2$

$\frac{1}{2}(mean([(248-245),(141-140)]^2) + mean([(33-30),(26-30)]^2) = \frac{1}{2} (\frac{9+1}{2} +\frac{9+16}{2}) = 8.75$


For some questions in this assignment, you will be asked to compute the number of parameters of a model.

Sample question: **How many learnable (or updatable) parameters are present in the above model?**

Sample answer:

linear_layer.weight = $2 \times 5$ = 10

linear_layer.bias = $2 \times 1$ = 2

Thus, the number of parameters is = 10 + 2 = 12

In [3]:
# Python code to get the same result
modules_in_model = [linear_layer]
count = 0
for module in modules_in_model:
  for p in module.parameters():
    print(p.data)
    count += p.numel()
print("the number of parameters:",count)

tensor([[  1.,   2.,   3.,   4.,   5.],
        [  1.,   3.,   0.,   0.,  10.]])
tensor([ 3.,  1.])
the number of parameters: 12


## Exercise 1: More linearities and non-linearities

### 1.1 Logistic Regression (2-class classification)

In [4]:
# the model
linear_layer = torch.nn.Linear(5, 2)

# set the parameters (weights, biases)
linear_layer.weight.data = torch.tensor([[1., 2., 3., 4., 5.], [1., 3., 0., 0., 10.]]) # sets the values of weight matrix (W)
linear_layer.bias.data = torch.tensor([3., 1.]) # sets the values of bias vector (b) (note unlike previous assignment, we assign different bias for each output unit)

# data (2 examples each with 5 input features and 1 target class value)
inputs = torch.tensor([[100., 10., 20., 15., 1.], [10., 5., 2., 1., 0.]]) # initialize the inputs (X)
targets = torch.tensor([0, 1]) # initialize the target classes (Y)

# forward propagation
model_out = linear_layer(inputs)
softmax_out = nn.LogSoftmax(dim=1)(model_out) # documentation: https://pytorch.org/docs/stable/nn.html#torch.nn.LogSoftmax
nll_loss = nn.NLLLoss() # documentation: https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss
loss_out = nll_loss(softmax_out, targets)

print("model loss (loss_out):",loss_out.data.item())

model loss (loss_out): 3.500455617904663


### 1.1 Compute the values in ``loss_out`` by hand. Show your work. 
(Hint: It might be easier if you compute ``model_out`` first, followed by ``softmax_out`` and then compute ``loss_out``) 

rubric={accuracy:3}

**your answer goes here:**

### 1.2 Single layer neural network (2-class classification)

In [5]:
# the model
layer_1 = torch.nn.Linear(5, 2) 
activation_fn = nn.ReLU()
layer_2 = torch.nn.Linear(2, 2)
softmax_out = nn.LogSoftmax(dim=1)(model_out)
nll_loss = nn.NLLLoss()

# set the parameters (weights, biases)
layer_1.weight.data = torch.tensor([[1., 2., 3., 4., 5.], [1., 3., 0., 0., 10.]]) # sets the values of weight matrix for 1st layer (W_1)
layer_1.bias.data = torch.tensor([3., 1.]) # sets the values of bias vector for 1st layer (b_1)
layer_2.weight.data = torch.tensor([[2., 2.], [1, 1]]) # sets the values of weight matrix for 2nd layer (W_2)
layer_2.bias.data = torch.tensor([2., 1.]) # sets the values of bias vector for 2nd layer (b_2)

# data (2 examples each with 5 features and 1 target class value)
inputs = torch.tensor([[100., 10., 20., 15., 1.], [10., 5., 2., 1., 0.]]) # initialize the inputs (X)
targets = torch.tensor([0, 1]) # initialize the target classes (Y)

# forward propagation
layer1_out = layer_1(inputs)
layer1_hidden = activation_fn(layer1_out)
layer2_out = layer_2(layer1_hidden)
softmax_out = nn.LogSoftmax(dim=1)(layer2_out)
loss_out = nll_loss(softmax_out, targets)

print("model loss (loss_out):", loss_out.data.item())

model loss (loss_out): 30.0


### 1.2 Compute the values in ``loss_out`` by hand. Show your work. 
(Hint: It might be easier if you compute ``layer1_out`` first, followed by ``layer1_hidden`` and the rest  ``layer2_out``, ``softmax_out``, ``loss_out`` in order.) 

rubric={accuracy:4}

**your answer goes here:**

### 1.3 Multi layer neural network (2-class classification)

In [6]:
# the model
layer_1 = torch.nn.Linear(5, 4)
activation_fn = nn.ReLU()
layer_2 = torch.nn.Linear(4, 3)
layer_3 = torch.nn.Linear(3, 2)
softmax_out = nn.LogSoftmax(dim=1)(model_out)
nll_loss = nn.NLLLoss()

# set the parameters (weights, biases)
layer_1.weight.data = torch.tensor([[1., 2., 3., 4., 5.], [1., 3., 0., 0., 10.], [1., 0., 0., 4., 5.], [1., 3., 1., 0., 0.]]) # sets the values of weight matrix for 1st layer (W_1)
layer_1.bias.data = torch.tensor([3., 1., 1., 0.]) # sets the values of bias vector for 1st layer (b_1)
layer_2.weight.data = torch.tensor([[2., 2., 2., 2.], [1., 1., 1., 1.], [0., 1., 1., 0.]]) # sets the values of weight matrix for 2nd layer (W_2)
layer_2.bias.data = torch.tensor([2., 1., 0.]) # sets the values of bias vector for 2nd layer (b_2)
layer_3.weight.data = torch.tensor([[2., 1., 2.], [1., 1., 0.]]) # sets the values of weight matrix for 3rd layer (W_3)
layer_3.bias.data = torch.tensor([1., 0.]) # sets the values of bias vector for 3rd layer (b_3)

# data (2 examples each with 5 features and 1 target class value)
inputs = torch.tensor([[100., 10., 20., 15., 1.], [10., 5., 2., 1., 0.]]) # initialize the inputs (X)
targets = torch.tensor([0, 1]) # initialize the target classes (Y)

# forward propagation
layer1_out = layer_1(inputs)
layer1_hidden = activation_fn(layer1_out)
layer2_out = layer_2(layer1_hidden)
layer2_hidden = activation_fn(layer2_out)
layer3_out = layer_3(layer2_hidden)
softmax_out = nn.LogSoftmax(dim=1)(layer3_out)
loss_out = nll_loss(softmax_out, targets)

print("model loss (loss_out):", loss_out.data.item())

model loss (loss_out): 143.5


### 1.3 Compute the values in **loss_out** by hand. Show your work.
(Hint: It might be easier if you compute ``layer1_out`` first, followed by ``layer1_hidden`` and the rest  ``layer2_out``, ``layer2_hidden``, ``layer3_out``, ``softmax_out``, ``loss_out`` in order.) 

rubric={accuracy:5}

**your answer goes here:**

### 1.4

In [7]:
layer_1 = torch.nn.Linear(5, 4)
print("layer_1 weight:\n", layer_1.weight.data)

layer_1 weight:
 tensor([[ 0.1564, -0.3429,  0.3451,  0.1402,  0.3094],
        [-0.1759,  0.0948,  0.4367,  0.3008,  0.3587],
        [-0.0939,  0.3407, -0.3503,  0.0387, -0.2518],
        [-0.1043, -0.1145,  0.0335,  0.4070,  0.2214]])


From what distribution (and range) the default values in ``layer_1.weight.data`` are sampled from? 
(Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear) of ``nn.Linear``. And the symbols  $\textit{U}$ and $\mathcal{N}$ correspond to uniform and normal (Gaussian) distribution.)

rubric={accuracy:2}

**your answer goes here:**

### 1.5 Write code to initialize parameters (weights and biases) in ``layer_1`` (previous queston, 1.4) with numbers sampled randomly from standard normal distribution (mean 0 and variance 1).

Hint: Look at ``torch.randn`` function

rubric={accuracy:2}

In [8]:
# your code goes here

### 1.6 How many learnable (or updatable) parameters are present in the single layer neural network defined in ``1.2``? Compute the result by writing code or by hand.
rubric={accuracy:2}

In [9]:
# your code goes here (if you are writing code, use this block. Otherwise, change this block to a Markdown block.)

### 1.7 How many learnable (or updatable) parameters are present in the multi layer neural network defined in ``1.3``? Compute the result by writing code or by hand.

rubric={accuracy:2}

In [10]:
# your code goes here (if you are writing code, use this block. Otherwise, change this block to a Markdown block.)

## Exercise 2: Embedding layer


### 2.1

In [11]:
# the model
embedding_model = nn.Embedding(4, 3) # 4 embeddings with each of three dimensions

# set the weights for each of the four embedding
embedding_model.weight.data = torch.tensor([[1., 2., 3.], [1., 1., 1.], [3., 0., 0.], [10., 20., 30.]])

# data (2 examples each with two inputs)
inputs = torch.tensor([[0, 2], [1, 3]])

# forward propagation for computing average of input embeddings
embeddings_out = embedding_model(inputs)
embeddings_avg = embeddings_out.mean(1)
print("embeddings_avg:\n", embeddings_avg.data)

embeddings_avg:
 tensor([[  2.0000,   1.0000,   1.5000],
        [  5.5000,  10.5000,  15.5000]])


### 2.1 Compute the values in **embeddings_avg** by hand. Show your work.
rubric={accuracy:2}

**your answer goes here:**

### 2.2 Can you reimplement 2.1 by using [``nn.EmbeddingBag``](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) instead of ``nn.Embedding`` by setting the appropriate mode?  (OPTIONAL question)

Hint: You can look at the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag) and/or [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#EmbeddingBag) of ``nn.EmbeddingBag``

rubric={spark:2}

In [12]:
# your code goes here

### 2.3 

Toy corpus worth 3 sentences (each row excluding the header corresponds to a sentence)

|  sentence no. | sentence text |
| --------------- | ------------------------------ |
| 1  | UBC’s Master of Data Science in Computational Linguistics is the credential to set you apart. |
| 2  | Offered at the Vancouver campus, this unique degree is tailored to those with a passion for language and data.|
| 3  | Over 10 months, the program combines foundational data science courses with advanced computational linguistics courses—equipping graduates with the skills to turn language-related data into knowledge and to build AI that can interpret human language. |

### 2.3.1 In the tutorial, we constructed word to index mapping for a one sentence corpus. Write code to build word to index mapping for this toy corpus containing three sentences.

rubric={accuracy:2}

In [13]:
# your code goes here

### 2.3.2 In the tutorial, we constructed the train data (input and output for each training example) for a one sentence corpus. Write code that outputs the train data for CBOW model created from this toy corpus and prints the number of training examples. 

Note:
- Assume the **window size to be 3**. 
- Use **truecase** of the words. 
- Use [white space tokenizer](https://kite.com/python/docs/nltk.WhitespaceTokenizer) to get the words from each sentence and no further preprocessing. 
- A training example is generated from a sentence and doesn't span across multiple sentences. 

rubric={accuracy:4}

In [14]:
# your code goes here

### 2.3.3 Write code that
- defines the CBOW model
- train the CBOW model (update its parameters) with all the training examples (use SGD)
- prints the word embedding for one of the word involved in the training before and after training
- assumes the hyperparameters: EMBEDDING_SIZE to 3, LEARNING_RATE to 0.5, WINDOW_SIZE to 3 and MAX_EPOCHS to 1.

rubric={accuracy:6}

In [15]:
# your code goes here

### 2.3.4 How many learnable (or updatable) parameters are present in the model defined in 2.3.3. Compute the result by writing code or by hand.
rubric={accuracy:2}

In [16]:
# your code goes here (if you are writing code, use this block. Otherwise, change this block to a Markdown block.)

## Exercise 3: Neural Network for Sentiment Analysis


The multilayer neural network code used in our tutorial is as follows:

In [17]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed)

# hyperparameters
BATCH_SIZE = 5
MAX_EPOCHS = 15
LEARNING_RATE = 0.5
MAX_FEATURES = 5000 # x_j, the number of j
NUM_CLASSES = 3

# dataset
DATA_FOLDER = "data/sentiment-twitter-2016-task4"
TRAIN_FILE = DATA_FOLDER + "/train.tsv"
VALID_FILE = DATA_FOLDER + "/dev.tsv"
TEST_FILE = DATA_FOLDER + "/test.tsv"

from sklearn.feature_extraction.text import TfidfVectorizer

# function for reading tsv file
def read_corpus(file):
    corpus = [] 
    for line in open(file):
        content, label = line.strip().split("\t") # first column is tweet, second column is golden label.
        corpus.append(content)
    return corpus

# reads the train corpus
train_corpus = read_corpus(TRAIN_FILE)  # get a list of tweets

# define the vectorizer
# builds a vocabulary that only considering the top max_features ordered by term frequency across the corpus.
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)

# fit the vectorizer on train set
vectorizer.fit(train_corpus)

# create a new class inheriting torch.utils.data.Dataset
class TweetSentimentDataset(Dataset):
  """ sentiment-twitter-2016-task4 dataset."""
  def __init__(self, file, vectorizer):
    # read the corpus
    corpus, labels = [], []
    for line in open(file):
      content, label = line.strip().split("\t")
      corpus.append(content)
      labels.append(int(label))
    
    # set the size of the corpus
    self.n = len(corpus)
    
    # vectorize all the tweets
    features = vectorizer.transform(corpus)
    
    # convert features and labels to torch.tensor
    self.features = torch.from_numpy(features.toarray()).float()
    self.features.to(device)
    self.labels = torch.tensor(labels, device=device, requires_grad=False)
    
  # return input and output of a single example
  # Input: Feature vectors, where each vector corresponds to a tweet. 
  # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
  def __getitem__(self, index):
    return self.features[index], self.labels[index]
  
  # return the total number of examples
  def __len__(self):
    return self.n

# create the dataloader object
train_loader = DataLoader(dataset=TweetSentimentDataset(TRAIN_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=True, num_workers=2) 
valid_loader = DataLoader(dataset=TweetSentimentDataset(VALID_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1)
test_loader = DataLoader(dataset=TweetSentimentDataset(TEST_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1) 

# train logic (similar to that of linear regression model)
def train(loader):
  total_loss = 0.0
  # iterate throught the data loader
  num_batches = 0
  for batch in loader:
    # load the current batch
    batch_input, batch_output = batch
    
    # forward propagation
    # pass the data through the model
    model_outputs = model(batch_input)
    # compute the loss
    cur_loss = criterion(model_outputs, batch_output)
    total_loss += cur_loss.item()
    
    # backward propagation (compute the gradients and update the model)
    # clear the buffer
    optimizer.zero_grad()
    # compute the gradients
    cur_loss.backward()
    # update the weights
    optimizer.step()
    
    num_batches += 1
  return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
  accuracy, num_examples = 0.0, 0
  with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
    for batch in loader:
      # load the current batch
      batch_input, batch_output = batch
      # forward propagation
      # pass the data through the model
      model_outputs = model(batch_input)
      # identify the predicted class for each example in the batch
      _, predicted = torch.max(model_outputs.data, 1)
      # compare with batch_output (gold labels) to compute accuracy
      accuracy += (predicted == batch_output).sum().item()
      num_examples += batch_output.size(0)
  return accuracy/num_examples

"""
create a custom model class inheriting torch.nn.Module
"""
class MultiLayerNeuralNetworkModel(nn.Module):
  
  def __init__(self, num_inputs, hidden_layers, num_outputs):
    # In the constructor we define the layers for our model
    super(MultiLayerNeuralNetworkModel, self).__init__()
    
    modules = [] # stores all the layers for the neural network
    input_dim = num_inputs
    # add input layer followed by hidden layers (excluding the classification module)
    for hidden_layer in hidden_layers:
      # add one layer followed by non-linearity (nn.Sigmoid)
      modules.append(nn.Linear(input_dim, hidden_layer))
      modules.append(nn.Sigmoid())
      input_dim = hidden_layer
    # add the classification module
    modules.append(nn.Linear(input_dim, num_outputs))
    modules.append(nn.LogSoftmax(dim=1))
    
    # create the model from all the modules
    self.model = nn.Sequential(*modules) # container of layers, for more details: https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    out = self.model(x)
    return out

# hyperparameter of neural network
hidden_layers = [50, 50]  # [num. of hidden units in first layer, num. of hidden units in second layer]

# define the loss function (last node of the graph)
model = MultiLayerNeuralNetworkModel(MAX_FEATURES, hidden_layers, NUM_CLASSES)
model.to(device)
print(model)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

MultiLayerNeuralNetworkModel(
  (model): Sequential(
    (0): Linear(in_features=5000, out_features=50, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=50, out_features=50, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=50, out_features=3, bias=True)
    (5): LogSoftmax()
  )
)


In [18]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)
  # compute the training accuracy
  train_acc = evaluate(train_loader)
  # compute the validation accuracy 
  val_acc = evaluate(valid_loader)
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/15], Loss: 1.0182, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [2/15], Loss: 1.0047, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [3/15], Loss: 1.0016, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [4/15], Loss: 1.0024, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [5/15], Loss: 1.0025, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [6/15], Loss: 1.0009, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [7/15], Loss: 0.9985, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [8/15], Loss: 0.9924, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [9/15], Loss: 0.9610, Training Accuracy: 0.5943, Validation Accuracy: 0.4747
Epoch [10/15], Loss: 0.9114, Training Accuracy: 0.6082, Validation Accuracy: 0.4927
Epoch [11/15], Loss: 0.8673, Training Accuracy: 0.6205, Validation Accuracy: 0.4902
Epoch [12/15], Loss: 0.8248, Training Accuracy: 0.6720, Validation Accuracy: 0.5208
E

### 3.1 In the original tutorial, we considered only *unigrams* as features to represent a tweet. Change the original tutorial code to consider *bigrams, trigrams* (along with unigrams, which is considered by default) as features to represent a tweet.

Hints: 
- Look at the documentation of [``TfidfVectorizer``](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to see how to incorporate bigrams, trigrams and so on.
- Modify the line ``vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)`` and keep the rest of the code intact.

### 3.1 **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)

rubric={accuracy:3, quality:1}

**Report performance of your model here:** (double-click to edit)

**3.1.1 Performance of my neural network model on the validation set after training is X.XXX%** on accuracy (fill in your accuracy).

**3.1.2 Your code:**

In [19]:
# your code goes here (put only the changed lines)

### 3.2  In the original tutorial, we considered only two hidden layers with 50 units each. Change the original tutorial code to consider *five hidden layers* with *100 dimensions each*.

Hints: 
- Modify the line ``hidden_layers = [50, 50]`` and keep the rest of the code intact.

### 3.2 **Hand in the**
- Accuracy on the validation set after training
- Python code (ONLY the changed lines)

rubric={accuracy:3, quality:1}

**Report performance of your model here:** (double-click to edit)

**3.2.1  Performance of my neural network model on the validation set after training is X.XXX%** on accuracy (fill in your accuracy).

**3.2.2 Your code:**

In [20]:
# your code goes here (put only the changed lines)

### 3.3 In the original tutorial, we used [*Sigmoid*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid) as the activation function. Change the original tutorial code to consider other nonlinearities such as [*ReLU*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu) and [*Tanh*](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh) and report the nonlinearity that gives the best validation performance.

Hints: 
- Modify the line ``modules.append(nn.Sigmoid())`` and keep the rest of the code intact.

### 3.3 **Hand in the**
- Nonlinearity function that gives the best validation performance
- Accuracy on the validation set after training with the best nonlinearity
- Python code (ONLY the changed lines)

rubric={accuracy:4, quality:1}

**Report your results here:** (double-click to edit)

**3.3.1. Report the nonlinearity function that gives the best validation performance: nn.XXX()** (change this)

**3.3.2. Report performance of my neural network model on the validation set after training is X.XXX%** on accuracy (fill in your accuracy).

**3.3.3 Your code:**

In [21]:
# your code goes here (put only the changed lines)

### 3.4 In the original tutorial, we used *learning rate of 0.5*. Change the original tutorial code by trying out different learning rates preferably *between 0.0001 and 1*. Report the learning rate that gives the best validation performance.

Hints: 
- Modify the line ``LEARNING_RATE = 0.5`` and keep the rest of the code intact.

### 3.4 **Hand in the**
- Learning rate that gives the best validation performance
- Accuracy on the validation set after training with the best learning rate
- Python code (ONLY the changed lines)

rubric={accuracy:6, quality:1}

**Report your resuls here:** (double-click to edit)

**3.4.1. Report the learning rate that gives the best validation performance: X.XXX** (fill in your learning rate).

**3.4.2. Report the performance of my neural network model on the validation set after training is X.XXX%** on accuracy (fill in your accuracy).

**3.4.3 Your code:**

In [22]:
# your code goes here (put only the changed lines)

## Exercise 4: Very-Short answer questions

(Double-click each question block and place your answer at the end of the question) 

### 4.1 Can we build deep neural network without any nonlinearity layers? What's the nature of such deep neural network without any nonlinearity layers?
rubric={reasoning:2}

### 4.2 What is the distinction between single layer neural network and multilayer neural network?
rubric={reasoning:2}

### 4.3 What is the difference between hidden layer and hidden unit (or neuron)?
rubric={reasoning:2}

### 4.4 What is the gradient of sigmoid function over a scalar ($\nabla \sigma(a)$)? What is the interesting property of this gradient?
rubric={reasoning:2}