## Homework 1 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- Introduction to Neural Nets, (Feed Forward-NNs / Multi-layer Perceptrons)
- Neural Net Hyperparameter Tuning
- Operations on tensor
- Linearities, non-linearities and loss functions 
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)
- Scikit Learn (>=0.23.2)
- Skorch (>=0.9)

### Submission Info. 
- Due Date: 1/16/21 11:59pm (Pacific Time)

## Getting Started

In [2]:
# all necessary imports
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import sklearn

# set the seed (allows reproducibility of the results)
manual_seed = 572
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # checks if GPU is there in this system and automatically uses GPU if its available, otherwise uses CPU.
torch.backends.cudnn.deterministic=True


## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Make sure code that is randomly initialized is set correctly with the manual_seed=572
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

# Exercise T1: Simple Neural Net Optimization

For this group assignment we will be applying a simple 2-layer Feed Forward Neural Net classifier for a classification task. we'll use a newsgroup data set, which consists of the words in posts from a set of different forums, with the post data represented in Tfidf format, can our classifiers predict which forum each post came from? In essense this is a quick recap of 571, but applying it to Neural Networks, which end up having much more hyperparameters to deal with than the algorithms you encountered in 571. For your sanity, we will limit how many different trials we expect you to run in each section. 

Our comparison will be a SVM classifier with a simple linear kernel.

### Data Loading 

In [2]:
from sklearn.datasets import fetch_20newsgroups_vectorized 
from sklearn.model_selection import train_test_split

train_data = fetch_20newsgroups_vectorized(subset='train', remove=('headers', 'footers', 'quotes'))
test_data =fetch_20newsgroups_vectorized(subset='test', remove=('headers', 'footers', 'quotes'))

X_train = train_data.data
y_train = train_data.target
X_test = test_data.data
y_test = test_data.target

### SVM Baseline

Let's run a quick test to see how well an SVM does on this dataset (might take a couple minutes):

In [3]:
from sklearn import metrics
from sklearn import svm

clf = svm.SVC(kernel='linear') #Use a linear kernel with default regularization for simplicity

clf.fit(X_train, y_train)

pred = clf.predict(X_test)

acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.5624004248539565


Like SVMs and other ML classifiers, neural nets have flexibility in their configuration, the so called hyperparameters that control how the network is structured and how it learns. And like SVMs it turns out that this tuning is extremely important for appropriate performance.

### Example 2 Layer NN

We'll use Skorch (see https://github.com/skorch-dev/skorch for conda install instructions) to allow us to use Sklearn alongside Pytorch. Note: in later assignments in this course we'll just focus on Pytorch directly, but it's always a good tool to have in your back pocket (if you want to use say Sklearn's cross validation tools).

In [12]:
from skorch import NeuralNetClassifier


##Define our NN
class TwoLayerNN(nn.Module): #feel free to change the inputs it takes
    def __init__(self, input_dim, hidden_units=20, output_classes=20, nonlin=nn.ReLU()):  
        super(TwoLayerNN, self).__init__()
        self.dense0 = nn.Linear(input_dim, hidden_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(p=.1)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units, output_classes)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.softmax(self.output(X))
        return X


#Make sure you run these each time before you (re)initialize the network!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    TwoLayerNN(input_dim=X_train.shape[1]),
    max_epochs=10,
    lr=0.01,
    optimizer=torch.optim.SGD,   
    optimizer__weight_decay=0.001,  #roughly equivalent to L2 regularization
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))


  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m3.0029[0m       [32m0.0486[0m        [35m3.0022[0m  4.1832
      2        [36m3.0018[0m       0.0486        [35m3.0012[0m  3.6511
      3        [36m3.0007[0m       0.0486        [35m3.0002[0m  3.5746
      4        [36m2.9998[0m       0.0486        [35m2.9993[0m  3.4552
      5        [36m2.9990[0m       0.0486        [35m2.9986[0m  3.4531
      6        [36m2.9982[0m       0.0486        [35m2.9978[0m  3.3736
      7        [36m2.9975[0m       0.0486        [35m2.9972[0m  3.4266
      8        [36m2.9969[0m       0.0486        [35m2.9966[0m  3.4702
      9        [36m2.9963[0m       0.0486        [35m2.9960[0m  3.4065
     10        [36m2.9957[0m       [32m0.0499[0m        [35m2.9955[0m  3.4624


<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=TwoLayerNN(
    (dense0): Linear(in_features=101631, out_features=20, bias=True)
    (nonlin): ReLU()
    (dropout): Dropout(p=0.1)
    (dense1): Linear(in_features=20, out_features=20, bias=True)
    (output): Linear(in_features=20, out_features=20, bias=True)
    (softmax): Softmax()
  ),
)

In [13]:
#when you're ready to run on TEST:
pred = net.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.05018587360594796


Not great! Let's look at improving our network.

### Hyperparameter Optimization

It turns out there are quite a few choices that we can make to optimize our NN: the architecture of our net (number of hidden layers, number of units in each layer, choice of nonlinearity (sigmoid, ReLU etc.)) being a major factor. We also care about our choice of training procedure such as which optimizer (SGD, Adam etc.) learning rate and number of epochs. Last but not least regularization is also important (L2 regularization can be thought of as weight decay in Pytorch).   

*NOTE 1*: One thing we won't touch yet, but is extremely important for good optimization is how model weights are initialized, we'll talk about this later.  

*NOTE 2*: Keep it to 2 layers, you'll be doing a multilayer model in question 2 (there are tricks to make it easier to implement).

#### Brief recap of these concepts
  
*Learning Rate*: Multiplication factor to set how far your model parameters move per each update.    
*Optimizer*: Which algorithm is used to update the model parameters (Stochastic Gradient Descent, Adam, Adadelta etc. we'll talk about these later).  
*Epoch*: A number/count to represent times that the model has cycled through the entire training set during training.  
*Nonlinearity*: A non-linear function that allows for neural nets to 'learn' to solve non-linearly seperable problems. Common ones include sigmoid, tanh, ReLU.  
*Regularization*: Penalty based on magnitude of the weighs to the model, encouraging your model to not overfit. Pytorch uses weight decay which is roughly equal to L2 regularization (but subtly differs). Other regularization types (L0, L1, L_inf etc.) might need to be manually calculated and added to loss directly.  
*Dropout*: A regularization technique used with Neural networks, randomly zeros values (based on a specified percent) passed through the dropout layer to encourage the model to generalize learning between nodes (don't rely on just some single node for something). 


#### For these sections it is advised to split up the work with teammates.

First let's do a quick manual search of the hyper parameters (you'll be graded here on just following a logical process and explaining it), second try a grid search, and finally a random search.

### T1 Manual Search
rubric={reasoning:1}

As a reminder, manual search is basically just using trial-and-error to find the best combination, starting with an educated guess as to the best parameters, and then making adjustments as you go. For this part, you are only allowed to try *5* different combinations of hyperparameters.

Document your starting point, and how you adjusted the hyperparameters along the way, reporting the accuracy for each round. Explain your reasoning for some of the choices you made.


#### Documentation Goes Here

ex:  
Round 1  

Regularization(L2=.001)  
Optimizer(SGD)  
Max_epochs(20)  
Learning_rate(0.01)  
NN_layers(2)  
NN_hidden_units(20 for all layers)  
Dropout(.1 on output of dense0)  
Nonlinearity(ReLU)  
~Anything Else~    
ACC = 0.05  Awful!  No better than chance!

*YOUR WORK*

*YOUR EXPLANATION*

### T1 Grid Search
rubric={accuracy:1, quality:1}

Another approach is to try a number of different values by setting intervals to check and then covering all possibilities. Use Scikit-learn's grid search functionality to check a total of around 20 different possible configurations of hyperparameters. **With very large epoch/parameter values it might take 30 minutes to run a trial (depending on your CPU/GPU) so plan accordingly and potentially split up the grid between teammates** (Teammates who didn't run the code can copy outputs into a rawNB convert box)

In [None]:
##Your code to run the grid search here


### T1 Random Search
rubric={accuracy:1, quality:1}

Finally, use scikit-learn's random search functionality to check a total of around 20 different possible configurations of hyperparameters. 

In [None]:
##Your code to run the grid search here


### T1 Hyperparameter Optimization Reflection
rubric={reasoning:2}  
Reflect on the process to get this basic Neural Network to work. Were you able to get it to perform better than our baseline SVM? How "easy"/"hard" was it to get to work well? What might you also consider (but perhaps didn't have time) to try to improve it?

## Exercise 1: Tensor Operations (Pytorch Intro)

We'll primarily be using Pytorch for coding neural nets in this class (as well as COLX 531 and some others), this exercise is here to give an introduction to the basic operations that are performed on the tensors that pytorch uses. Pytorch tensors are basically numpy arrays, with the advantage that they can be loaded onto GPU to perform extremely fast parallel calculations. Remember in DSCI 512 the speedup from parallelizing code? Deep learning utilizes this in spades!

### 1.1 Write code that creates a tensor, **X** of size $5 \times 5$ containing longs with values initialized to ones. 
rubric={accuracy:1}

In [2]:
# your code goes here



### 1.2 Write code that takes the tensor, **X** (from the previous question 1.1) and sets the values along the diagonal to two.
rubric={accuracy:1}

In [3]:
# your code goes here



### 1.3 Write code that takes the tensor, **X** (from the previous question 1.2), squares all the values in **X**, sums all the squared values in **X** and prints the square root of this sum? (L2-norm) 
rubric={accuracy:1}

In [4]:
# your code goes here



## 1.4 Given the following two tensors, **X** $\in \mathcal{R}^{4\times4}$ and $\textbf{Y} \in \mathcal{R}^{4\times4}$

In [5]:
X = torch.rand(4,4)
print(X)
Y = torch.rand(4,4)
print(Y)

tensor([[0.2961, 0.5166, 0.2517, 0.6886],
        [0.0740, 0.8665, 0.1366, 0.1025],
        [0.1841, 0.7264, 0.3153, 0.6871],
        [0.0756, 0.1966, 0.3164, 0.4017]])
tensor([[0.1186, 0.8274, 0.3821, 0.6605],
        [0.8536, 0.5932, 0.6367, 0.9826],
        [0.2745, 0.6584, 0.2775, 0.8573],
        [0.8993, 0.0390, 0.9268, 0.7388]])


### 1.4.1 Write code that performs standard matrix multiplication, multiply **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [6]:
# your code goes here



### 1.4.2 Write code that performs standard addition of two matrices, add **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [7]:
# your code goes here



### 1.4.3 Write code that subtracts matrix **Y** from **X** without changing their values and prints the result.
rubric={accuracy:1}

In [8]:
# your code goes here



### 1.4.4 Write code that performs standard matrix multiplication, multiply **X** and **Y** and placing the results directly in **X** (modifying **X**) and prints the result.
rubric={accuracy:1}

In [9]:
# your code goes here



## 1.5 Given the following tensor, **X** $\in \mathcal{R}^{5\times3}$

In [10]:
X = torch.rand(5,3)

### 1.5.1 Write code to print all the elements in the last row of **X**.
rubric={accuracy:1}

In [11]:
# your code goes here



### 1.5.2 Write code to print all the elements in the middle column of **X**.
rubric={accuracy:1}

In [12]:
# your code goes here



### 1.5.3 Write code to create a 3D tensor of size $1 \times 5 \times 3$ using the $5 \times 3$ values from **X** (unsqueeze operation)
rubric={accuracy:1}

(You'll often need to "squeeze" or "unsqueeze" tensors to make sure that the dimensions are correct for certain parts of your model.)

In [13]:
# your code goes here



### 1.5.4 Write code that converts the 3D tensor (created in the previous question (c)) back into 2D tensor (of size $5 \times 3$). (squeeze operation)
rubric={accuracy:1}

In [14]:
# your code goes here



## Exercise 2: Putting the "Multi" in Multilayer Perceptron

In T1 you did hyper parameters optimization on a 2 layer neural network. For this assignment, you'll be using the same dataset (20 Newsgroups) but you'll need to build a model that can support an arbitrary number of layers (rather than just two)! There are two tricks to help out with this, the first is *nn.sequential* which allows you to stack pytorch modules together in a *cascade* fashion (*cascade* here meaning one thing passed to another like a pipeline). The other trick is *nn.modulelist* which will allow us to keep track of lists of modules, which we can then iterate through to build out network and perform *forward* passes through the network.

#### nn.Sequential example

In [11]:
#Make sure you run these each time before you (re)initialize the network or data!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)


## Fake data for testing
x = torch.rand((5,10))


## Building the model.  
## Note: to use inside a larger model, set it as:  self.name_of_layer  = nn.Sequential(...)
## in the initialization function

layers = nn.Sequential(
          nn.Linear(10,5),
          nn.ReLU(),
          nn.Softmax(dim=-1)
        )

## Forward pass:

output = example_model(x)

print(output)

tensor([[0.2510, 0.1498, 0.1444, 0.1669, 0.2878],
        [0.2776, 0.1573, 0.1573, 0.1692, 0.2386],
        [0.2137, 0.2039, 0.1484, 0.1878, 0.2462],
        [0.2488, 0.1647, 0.1614, 0.1614, 0.2638],
        [0.1998, 0.1814, 0.1532, 0.1532, 0.3123]], grad_fn=<SoftmaxBackward>)


### 2.1 A 5 layer MLP using nn.Sequential
rubric={accuracy:2, quality:2}

Using nn.Sequential build a network with the following parameters: 5 Hidden Layers, each with 40 hidden units, using a Tanh activation function between layers, 20% dropout on each layer, and finally a softmax output.

In [None]:
### My MLP Here!

### 2.2 Training and testing our 5 layer MLP
rubric={accuracy:1, quality:1}

Now train the network on the 20 Newgroups training data, and report the Test set accuracy.

In [None]:
### My MLP training / testing code!

#### nn.ModuleList example


In [12]:

#Make sure you run these each time before you (re)initialize the network or data!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)


## Fake data for testing
x = torch.rand((5,10))

layers = nn.ModuleList()   ## Again you'll need to assign as: self.layers = nn.ModuleList()   in the initialization function
## Two layers
for i in range(2):
    layers.append(nn.Linear(10,10))
    layers.append(nn.ReLU())
layers.append(nn.Linear(10,5))  #output layer
layers.append(nn.Softmax(dim=-1))

## Forward pass:
for layer in layers:
    x = layer(x)

print(x)

tensor([[0.1650, 0.2360, 0.2021, 0.2144, 0.1825],
        [0.1614, 0.2381, 0.2030, 0.2141, 0.1834],
        [0.1638, 0.2382, 0.2023, 0.2146, 0.1811],
        [0.1626, 0.2370, 0.2024, 0.2158, 0.1822],
        [0.1650, 0.2403, 0.2015, 0.2112, 0.1820]], grad_fn=<SoftmaxBackward>)


### 2.3 Arbitrary depth MLP using nn.ModuleList()
rubric={accuracy:2, quality:2}

Using nn.ModuleList to build a MLP class that takes input for number of layers, input/output dimensions, hidden units, activation function, and dropout percent and builds the corresponding network.  Output should pass through appropriate layers to a final softmax.

In [None]:
### My MLP Here!

### 2.4 Training and testing a 6 layer MLP
rubric={accuracy:1, quality:1}

Using the class you built in 2.3, build and train a 6 layer MLP network on the 20 Newgroups training data, and report the Test set accuracy. You should set hidden units per layer to 50 units, dropout percent to 20%, and use sigmoid as your activation.

In [None]:
### My MLP training / testing code!

## Exercise 3: Very-Short answer questions

(Double-click each question block and place your answer at the end of the question) 

### 3.1 What is NumPy? What are the differences between PyTorch's Tensor and NumPy Array?
rubric={reasoning:1}

### 3.2 What is the key difference between ``torch.LongTensor`` and ``torch.cuda.LongTensor``?
rubric={reasoning:1}

### 3.3 What is the default data type of a PyTorch tensor?
rubric={accuracy:1}

### 3.4 What is ``autograd`` in PyTorch? How is it related to computational graph?
rubric={reasoning:1}

### 3.5 What is SGD? How is it different from Gradient Descent? Finally what role SGD plays in building machine learning models?
rubric={reasoning:1}

## (Not Graded)  Linearities, Non-linearities, Loss functions by Hand

This is a quick linear algebra review exercise, if you're a little rusty, or if you haven't taken a linear algebra class, this is a peak under the hood to show the math that Pytorch is doing when it's calculating tensor operations.  (See the separate solutions in the github lab folder to check your work)

Sample question:

In [15]:
linear_layer = torch.nn.Linear(5, 1)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

tensor([168.], grad_fn=<AddBackward0>)


Compute the values in **model\_out** by hand. Show your work.

Sample answer: (write it in markdown, not as code. if you don't like markdown, you can write the steps in a piece of paper, take a photo and attach an image in the answer block)

your answer goes here:

$model\_out = A x + b = [1, 2, 3, 4, 5] * [0, 10, 20, 15, 5] + 3 = (1*0 + 2*10 + 3*20 + 4*15 + 5*5) + 3 = 165 + 3 = 168 $


### 4.1

In [1]:
linear_layer = torch.nn.Linear(5, 1, bias=False)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

NameError: name 'torch' is not defined

### Compute the values in **model\_out** by hand. Show your work.

your answer goes here (double-click this block to edit):

### 4.2

In [17]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 0, 0, 0, 1]) # sets the weight value
linear_layer.bias.data = torch.tensor([1]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 1]).float())
sigmoid_out = torch.nn.Sigmoid()(model_out)
print(sigmoid_out)

tensor([1.0000, 0.8808], grad_fn=<SigmoidBackward>)


### Compute the values in **sigmoid\_out** by hand. Show your work.


your answer goes here:

### 4.3

In [18]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
softmax_out = torch.nn.Softmax(dim=1)(model_out)
print(softmax_out)

tensor([[1.0000, 0.0000],
        [0.9933, 0.0067]], grad_fn=<SoftmaxBackward>)


### Compute the values in **softmax\_out** by hand. Show your work.


your answer goes here:

### 4.4

In [19]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
criterion = torch.nn.MSELoss()
loss = criterion(model_out, torch.tensor([[245, 140], [30, 30]]).float())
print(loss)

tensor(7.7500, grad_fn=<MseLossBackward>)


### Compute the values in **loss** by hand. Show your work.


your answer goes here: