## Homework 1 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- Introduction to Neural Nets, (Feed Forward-NNs / Multi-layer Perceptrons)
- Neural Net Hyperparameter Tuning
- Operations on tensor
- Linearities, non-linearities and loss functions 
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)
- Scikit Learn (>=0.23.2)
- Skorch (>=0.9)

### Submission Info. 
- Due Date: 1/16/21 6pm (Pacific Time)

## Getting Started

In [2]:
# all necessary imports
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import sklearn

# set the seed (allows reproducibility of the results)
manual_seed = 572
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # checks if GPU is there in this system and automatically uses GPU if its available, otherwise uses CPU.
torch.backends.cudnn.deterministic=True
print(device)

cuda


## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Make sure code that is randomly initialized is set correctly with the manual_seed=572
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

# Exercise T1: Simple Neural Net Optimization

For this group assignment we will be applying a simple 2-layer Feed Forward Neural Net classifier for a classification task. we'll use a newsgroup data set, which consists of the words in posts from a set of different forums, with the post data represented in Tfidf format, can our classifiers predict which forum each post came from? In essense this is a quick recap of 571, but applying it to Neural Networks, which end up having much more hyperparameters to deal with than the algorithms you encountered in 571. For your sanity, we will limit how many different trials we expect you to run in each section. 

Our comparison will be a SVM classifier with a simple linear kernel.

### Data Loading 

In [3]:
from sklearn.datasets import fetch_20newsgroups_vectorized 
from sklearn.model_selection import train_test_split

train_data = fetch_20newsgroups_vectorized(subset='train', remove=('headers', 'footers', 'quotes'))
test_data =fetch_20newsgroups_vectorized(subset='test', remove=('headers', 'footers', 'quotes'))

X_train = train_data.data
y_train = train_data.target
X_test = test_data.data
y_test = test_data.target

### SVM Baseline

Let's run a quick test to see how well an SVM does on this dataset (might take a couple minutes):

In [3]:
from sklearn import metrics
from sklearn import svm

clf = svm.SVC(kernel='linear') #Use a linear kernel with default regularization for simplicity

clf.fit(X_train, y_train)

pred = clf.predict(X_test)

acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.5624004248539565


Like SVMs and other ML classifiers, neural nets have flexibility in their configuration, the so called hyperparameters that control how the network is structured and how it learns. And like SVMs it turns out that this tuning is extremely important for appropriate performance.

### Example 2 Layer NN

We'll use Skorch (see https://github.com/skorch-dev/skorch for conda install instructions) to allow us to use Sklearn alongside Pytorch. Note: in later assignments in this course we'll just focus on Pytorch directly, but it's always a good tool to have in your back pocket (if you want to use say Sklearn's cross validation tools).

In [8]:
from skorch import NeuralNetClassifier


##Define our NN
class TwoLayerNN(nn.Module): #feel free to change the inputs it takes
    def __init__(self, input_dim, hidden_units=20, output_classes=20, nonlin=nn.ReLU()):  
        super(TwoLayerNN, self).__init__()
        self.dense0 = nn.Linear(input_dim, hidden_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(p=.1)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units, output_classes)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.softmax(self.output(X))
        return X


#Make sure you run these each time before you (re)initialize the network!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    TwoLayerNN(input_dim=X_train.shape[1]),
    max_epochs=10,
    lr=0.01,
    optimizer=torch.optim.SGD,   
    optimizer__weight_decay=0.001,  #roughly equivalent to L2 regularization
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))


  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m3.0029[0m       [32m0.0486[0m        [35m3.0022[0m  5.8500
      2        [36m3.0018[0m       0.0486        [35m3.0012[0m  5.2916
      3        [36m3.0007[0m       0.0486        [35m3.0002[0m  4.5647
      4        [36m2.9998[0m       0.0486        [35m2.9993[0m  4.6539
      5        [36m2.9990[0m       0.0486        [35m2.9986[0m  4.7641
      6        [36m2.9982[0m       0.0486        [35m2.9978[0m  4.5727
      7        [36m2.9975[0m       0.0486        [35m2.9972[0m  4.6661
      8        [36m2.9969[0m       0.0486        [35m2.9966[0m  4.6782
      9        [36m2.9963[0m       0.0486        [35m2.9960[0m  4.6634
     10        [36m2.9957[0m       [32m0.0499[0m        [35m2.9955[0m  4.7008


<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=TwoLayerNN(
    (dense0): Linear(in_features=101631, out_features=20, bias=True)
    (nonlin): ReLU()
    (dropout): Dropout(p=0.1)
    (dense1): Linear(in_features=20, out_features=20, bias=True)
    (output): Linear(in_features=20, out_features=20, bias=True)
    (softmax): Softmax()
  ),
)

In [13]:
#when you're ready to run on TEST:
pred = net.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.05018587360594796


Not great! Let's look at improving our network.

### Hyperparameter Optimization

It turns out there are quite a few choices that we can make to optimize our NN: the architecture of our net (number of hidden layers, number of units in each layer, choice of nonlinearity (sigmoid, ReLU etc.)) being a major factor. We also care about our choice of training procedure such as which optimizer (SGD, Adam etc.) learning rate and number of epochs. Last but not least regularization is also important (L2 regularization can be thought of as weight decay in Pytorch).   

*NOTE 1*: One thing we won't touch yet, but is extremely important for good optimization is how model weights are initialized, we'll talk about this later.  

*NOTE 2*: Keep it to 2 layers, you'll be doing a multilayer model in question 2 (there are tricks to make it easier to implement).

#### Brief recap of these concepts
  
*Learning Rate*: Multiplication factor to set how far your model parameters move per each update.    
*Optimizer*: Which algorithm is used to update the model parameters (Stochastic Gradient Descent, Adam, Adadelta etc. we'll talk about these later).  
*Epoch*: A number/count to represent times that the model has cycled through the entire training set during training.  
*Nonlinearity*: A non-linear function that allows for neural nets to 'learn' to solve non-linearly seperable problems. Common ones include sigmoid, tanh, ReLU.  
*Regularization*: Penalty based on magnitude of the weighs to the model, encouraging your model to not overfit. Pytorch uses weight decay which is roughly equal to L2 regularization (but subtly differs). Other regularization types (L0, L1, L_inf etc.) might need to be manually calculated and added to loss directly.  
*Dropout*: A regularization technique used with Neural networks, randomly zeros values (based on a specified percent) passed through the dropout layer to encourage the model to generalize learning between nodes (don't rely on just some single node for something). 


#### For these sections it is advised to split up the work with teammates.

First let's do a quick manual search of the hyper parameters (you'll be graded here on just following a logical process and explaining it), second try a grid search, and finally a random search.

### T1 Manual Search
rubric={reasoning:1}

As a reminder, manual search is basically just using trial-and-error to find the best combination, starting with an educated guess as to the best parameters, and then making adjustments as you go. For this part, you are only allowed to try *5* different combinations of hyperparameters.

Document your starting point, and how you adjusted the hyperparameters along the way, reporting the accuracy for each round. Explain your reasoning for some of the choices you made.


#### Documentation Goes Here

ex:  
Round 1  

Regularization(L2=.001)  
Optimizer(SGD)  
Max_epochs(20)  
Learning_rate(0.01)  
NN_layers(2)  
NN_hidden_units(20 for all layers)  
Dropout(.1 on output of dense0)  
Nonlinearity(ReLU)  
~Anything Else~    
ACC = 0.05  Awful!  No better than chance!

*YOUR WORK*

*YOUR EXPLANATION*


Grading overview: As long as you provide a rational for what hyperparameters you change, and logically follow some process you will get full credit.

### T1 Grid Search
rubric={accuracy:1, quality:1}

Another approach is to try a number of different values by setting intervals to check and then covering all possibilities. Use Scikit-learn's grid search functionality to check a total of around 20 different possible configurations of hyperparameters. **With very large epoch/parameter values it might take 30 minutes to run a trial (depending on your CPU/GPU) so plan accordingly and potentially split up the grid between teammates** (Teammates who didn't run the code can copy outputs into a rawNB convert box)

In [6]:
##Your code to run the grid search here
from sklearn.model_selection import GridSearchCV
from skorch import NeuralNetClassifier




class TwoLayerNN(nn.Module): #I changed the arguments to allow dropout as an input
    def __init__(self, input_dim, hidden_units=20, output_classes=20, nonlin=nn.ReLU(), dropout=.5):  
        super(TwoLayerNN, self).__init__()
        self.dense0 = nn.Linear(input_dim, hidden_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(p=dropout)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units, output_classes)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.softmax(self.output(X))
        return X

torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    TwoLayerNN(input_dim=X_train.shape[1]),
    max_epochs=10,
    lr=0.01,
    optimizer=torch.optim.Adam,   #Adam can be a little faster than SGD, we'll discuss the difference in lecture
    optimizer__weight_decay=0.001,  #roughly equivalent to L2 regularization
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.set_params(train_split=False, verbose=0) #train_split=False means we'll use our own test set, verbose=0 means it won't print too much out

params = {
    #task specific
    'module__input_dim': [X_train.shape[1]],
    'module__output_classes': [20],
    #training hyperparameters
    'lr': [0.01],
    'max_epochs': [20],
    #model architecture hyperparameters 
    'module__hidden_units': [50,100,150,200],
    'module__nonlin': [nn.ReLU(),nn.Tanh()],
    'module__dropout': [.1,.5,.9],
}
gs = GridSearchCV(net, params, refit=True,cv=2, scoring='accuracy', verbose=2)  #could do more CV folds in practice 

gs.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))


Fitting 2 folds for each of 24 candidates, totalling 48 fits
[CV] lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, total=  52.6s
[CV] lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   53.4s remaining:    0.0s


[CV]  lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, total=  57.8s
[CV] lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, total= 1.1min
[CV] lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=50, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, total=  58.5s
[CV] lr=0.01, max_epochs=20, module__dropout=0.1, module__hidden_units=100, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.

[CV]  lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=150, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, total= 1.0min
[CV] lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=200, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=200, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, total= 1.1min
[CV] lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=200, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=200, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, total= 1.2min
[CV] lr=0.01, max_epochs=20, module__dropout=0.5, module__hidden_units=200, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20 
[CV]  lr=0.01, max_epochs=20, module__dropo

[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 51.0min finished


GridSearchCV(cv=2,
             estimator=<class 'skorch.classifier.NeuralNetClassifier'>[uninitialized](
  module=TwoLayerNN(
    (dense0): Linear(in_features=101631, out_features=20, bias=True)
    (nonlin): ReLU()
    (dropout): Dropout(p=0.5)
    (dense1): Linear(in_features=20, out_features=20, bias=True)
    (output): Linear(in_features=20, out_features=20, bias=True)
    (softmax): Softmax()
  ),
),
             param_grid={'lr': [0.01], 'max_epochs': [20],
                         'module__dropout': [0.1, 0.5, 0.9],
                         'module__hidden_units': [50, 100, 150, 200],
                         'module__input_dim': [101631],
                         'module__nonlin': [ReLU(), Tanh()],
                         'module__output_classes': [20]},
             scoring='accuracy', verbose=2)

In [12]:
print(gs.best_score_)
print(gs.best_params_)

0.61331094219551
{'lr': 0.01, 'max_epochs': 20, 'module__dropout': 0.5, 'module__hidden_units': 50, 'module__input_dim': 101631, 'module__nonlin': Tanh(), 'module__output_classes': 20}


In [13]:
from sklearn import metrics
pred = gs.best_estimator_.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.5821826872012745


### T1 Random Search
rubric={accuracy=1, quality=1}

Finally, use scikit-learn's random search functionality to check a total of around 20 different possible configurations of hyperparameters. 

In [20]:
##Your code to run the grid search here
from sklearn.model_selection import RandomizedSearchCV
from skorch import NeuralNetClassifier
import scipy.stats


class TwoLayerNN(nn.Module): #feel free to change the inputs it takes
    def __init__(self, input_dim, hidden_units=20, output_classes=20, nonlin=nn.ReLU(), dropout=.5):  
        super(TwoLayerNN, self).__init__()
        self.dense0 = nn.Linear(input_dim, hidden_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(p=dropout)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units, output_classes)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.softmax(self.output(X))
        return X

torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    TwoLayerNN(input_dim=X_train.shape[1]),
    optimizer=torch.optim.Adam,   #Adam can be a little faster than SGD, we'll discuss the difference in lecture
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.set_params(train_split=False, verbose=0)  

params = {
    #task specific
    'module__input_dim': [X_train.shape[1]],
    'module__output_classes': [20],
    #training hyperparameters
    'lr': [.01],
    'max_epochs': [20],
    'optimizer__weight_decay': scipy.stats.loguniform(.000001,.01) ,
    #model architecture hyperparameters 
    'module__hidden_units': scipy.stats.randint(low=50,high=500) ,
    'module__nonlin': [nn.ReLU(),nn.Tanh()],
    'module__dropout': scipy.stats.uniform(),
}
rs = RandomizedSearchCV(net, params, refit=True, cv=2, n_iter=20, random_state=manual_seed, scoring='accuracy', verbose=2)  

rs.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))

Fitting 2 folds for each of 20 candidates, totalling 40 fits
[CV] lr=0.01, max_epochs=20, module__dropout=0.5116352079319715, module__hidden_units=482, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=9.811577102745637e-06 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  lr=0.01, max_epochs=20, module__dropout=0.5116352079319715, module__hidden_units=482, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=9.811577102745637e-06, total= 1.8min
[CV] lr=0.01, max_epochs=20, module__dropout=0.5116352079319715, module__hidden_units=482, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=9.811577102745637e-06 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.8min remaining:    0.0s


[CV]  lr=0.01, max_epochs=20, module__dropout=0.5116352079319715, module__hidden_units=482, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=9.811577102745637e-06, total= 1.8min
[CV] lr=0.01, max_epochs=20, module__dropout=0.059080698832860934, module__hidden_units=425, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, optimizer__weight_decay=0.00037907222167547734 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.059080698832860934, module__hidden_units=425, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, optimizer__weight_decay=0.00037907222167547734, total= 1.7min
[CV] lr=0.01, max_epochs=20, module__dropout=0.059080698832860934, module__hidden_units=425, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, optimizer__weight_decay=0.00037907222167547734 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.059080698832860934, module__hidden_units=425, module__inp

[CV]  lr=0.01, max_epochs=20, module__dropout=0.07073324041453222, module__hidden_units=273, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=0.0049779537405538685, total= 1.3min
[CV] lr=0.01, max_epochs=20, module__dropout=0.07073324041453222, module__hidden_units=273, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=0.0049779537405538685 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.07073324041453222, module__hidden_units=273, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=0.0049779537405538685, total= 1.3min
[CV] lr=0.01, max_epochs=20, module__dropout=0.5188081489541363, module__hidden_units=461, module__input_dim=101631, module__nonlin=ReLU(), module__output_classes=20, optimizer__weight_decay=0.0017813733700752365 
[CV]  lr=0.01, max_epochs=20, module__dropout=0.5188081489541363, module__hidden_units=461, module__input_dim=1

[CV]  lr=0.01, max_epochs=20, module__dropout=0.5652982129529925, module__hidden_units=308, module__input_dim=101631, module__nonlin=Tanh(), module__output_classes=20, optimizer__weight_decay=2.6904903987431842e-05, total= 1.4min


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 53.7min finished


RandomizedSearchCV(cv=2,
                   estimator=<class 'skorch.classifier.NeuralNetClassifier'>[uninitialized](
  module=TwoLayerNN(
    (dense0): Linear(in_features=101631, out_features=20, bias=True)
    (nonlin): ReLU()
    (dropout): Dropout(p=0.5)
    (dense1): Linear(in_features=20, out_features=20, bias=True)
    (output): Linear(in_features=20, out_features=20, bias=True)
    (softmax): Softmax()
  ),
)...
                                        'module__dropout': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000227ACBE2390>,
                                        'module__hidden_units': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000227ACBE26D8>,
                                        'module__input_dim': [101631],
                                        'module__nonlin': [ReLU(), Tanh()],
                                        'module__output_classes': [20],
                                        'optimizer__weight_decay': <scipy.stats._d

In [21]:
print(rs.best_params_)
print(rs.best_score_)

{'lr': 0.01, 'max_epochs': 20, 'module__dropout': 0.8418717674059527, 'module__hidden_units': 388, 'module__input_dim': 101631, 'module__nonlin': ReLU(), 'module__output_classes': 20, 'optimizer__weight_decay': 5.946138722551874e-06}
0.6776560014141771


In [22]:
from sklearn import metrics
pred = rs.best_estimator_.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)

0.6500265533722783


### T1 Hyperparameter Optimization Reflection
rubric={reasoning:2}  
Reflect on the process to get this basic Neural Network to work. Were you able to get it to perform better than our baseline SVM? How "easy"/"hard" was it to get to work well? What might you also consider (but perhaps didn't have time) to try to improve it?

**Example of something you could have discussed:   We found that randomized search provides an easier means of fine tuning the hyperparameters of neural nets compared to grid search. Because there are so many potential combinations of hyper parameters, using a random search can cover greater "area" in the search than grid search, specifically the ability to search over a distribution rather than sets of values can discover optima outside of the standard "grid". Hand tuning the network was difficult because having an intuition for a good size for the network only really came after running a few times, similarly some choices we made didn't seem to have a clear impact, which makes it a little frustrating to feel like you're backtracking.**

## Exercise 1: Tensor Operations (Pytorch Intro)

We'll primarily be using Pytorch for coding neural nets in this class (as well as COLX 531 and some others), this exercise is here to give an introduction to the basic operations that are performed on the tensors that pytorch uses. Pytorch tensors are basically numpy arrays, with the advantage that they can be loaded onto GPU to perform extremely fast parallel calculations. Remember in DSCI 512 the speedup from parallelizing code? Deep learning utilizes this in spades!

### 1.1 Write code that creates a tensor, **X** of size $5 \times 5$ containing longs with values initialized to ones. 
rubric={accuracy:1}

In [5]:
# your code goes here
x = torch.tensor((),dtype=torch.long).new_ones((5,5))

### 1.2 Write code that takes the tensor, **X** (from the previous question 1.1) and sets the values along the diagonal to two.
rubric={accuracy:1}

In [6]:
# your code goes here

mask = torch.eye(5,5).bool()  #create a mask of the diagonal
x.masked_fill_(mask,2)

tensor([[2, 1, 1, 1, 1],
        [1, 2, 1, 1, 1],
        [1, 1, 2, 1, 1],
        [1, 1, 1, 2, 1],
        [1, 1, 1, 1, 2]])

### 1.3 Write code that takes the tensor, **X** (from the previous question 1.2), squares all the values in **X**, sums all the squared values in **X** and prints the square root of this sum? (L2-norm) 
rubric={accuracy:1}

In [7]:
# your code goes here
#need to convert to a floating point type
x = x.double()
print(torch.sqrt(torch.sum(torch.mul(x,x))))


#to check:
print(torch.norm(x)) 


tensor(6.3246, dtype=torch.float64)
tensor(6.3246, dtype=torch.float64)


## 1.4 Given the following two tensors, **X** $\in \mathcal{R}^{4\times4}$ and $\textbf{Y} \in \mathcal{R}^{4\times4}$

In [7]:
torch.manual_seed(manual_seed)

X = torch.rand(4,4)
print(X)
Y = torch.rand(4,4)
print(Y)

tensor([[0.9786, 0.3998, 0.7621, 0.0330],
        [0.4713, 0.0497, 0.1247, 0.4957],
        [0.5379, 0.8330, 0.0382, 0.4521],
        [0.4375, 0.9377, 0.5235, 0.5487]])
tensor([[0.5677, 0.9688, 0.5192, 0.4743],
        [0.6601, 0.4801, 0.5510, 0.7869],
        [0.2726, 0.8603, 0.2272, 0.2190],
        [0.4126, 0.3936, 0.4801, 0.9458]])


### 1.4.1 Write code that performs standard matrix multiplication, multiply **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [8]:
# your code goes here

print(torch.matmul(X,Y))
#or...
print(X@Y)

tensor([[1.0409, 1.8087, 0.9173, 0.9769],
        [0.5389, 0.7828, 0.5384, 0.7588],
        [1.0522, 1.1318, 0.9640, 1.3466],
        [1.2365, 1.5404, 1.1262, 1.5790]])
tensor([[1.0409, 1.8087, 0.9173, 0.9769],
        [0.5389, 0.7828, 0.5384, 0.7588],
        [1.0522, 1.1318, 0.9640, 1.3466],
        [1.2365, 1.5404, 1.1262, 1.5790]])


In [6]:
# your code goes here

print(torch.matmul(X,Y))


### 1.4.2 Write code that performs standard addition of two matrices, add **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [9]:
# your code goes here
print(X + Y)


tensor([[1.5463, 1.3686, 1.2813, 0.5073],
        [1.1314, 0.5298, 0.6757, 1.2826],
        [0.8105, 1.6933, 0.2654, 0.6711],
        [0.8501, 1.3313, 1.0036, 1.4945]])


### 1.4.3 Write code that subtracts matrix **Y** from **X** without changing their values and prints the result.
rubric={accuracy:1}

In [10]:
# your code goes here

print(X - Y)

tensor([[ 0.4109, -0.5690,  0.2429, -0.4413],
        [-0.1889, -0.4304, -0.4264, -0.2912],
        [ 0.2652, -0.0274, -0.1889,  0.2331],
        [ 0.0249,  0.5441,  0.0434, -0.3971]])


### 1.4.4 Write code that performs standard matrix multiplication, multiply **X** and **Y** and placing the results directly in **X** (modifying **X**) and prints the result.
rubric={accuracy:1}

In [11]:
# your code goes here

#note matmul is usually not an in-place operation, think about why that is.
print(X)
X = torch.matmul(X,Y)
print(X)

tensor([[0.9786, 0.3998, 0.7621, 0.0330],
        [0.4713, 0.0497, 0.1247, 0.4957],
        [0.5379, 0.8330, 0.0382, 0.4521],
        [0.4375, 0.9377, 0.5235, 0.5487]])
tensor([[1.0409, 1.8087, 0.9173, 0.9769],
        [0.5389, 0.7828, 0.5384, 0.7588],
        [1.0522, 1.1318, 0.9640, 1.3466],
        [1.2365, 1.5404, 1.1262, 1.5790]])


## 1.5 Given the following tensor, **X** $\in \mathcal{R}^{5\times3}$

In [12]:
torch.manual_seed(manual_seed)
X = torch.rand(5,3)
print(X)

tensor([[0.9786, 0.3998, 0.7621],
        [0.0330, 0.4713, 0.0497],
        [0.1247, 0.4957, 0.5379],
        [0.8330, 0.0382, 0.4521],
        [0.4375, 0.9377, 0.5235]])


### 1.5.1 Write code to print all the elements in the last row of **X**.
rubric={accuracy:1}

In [13]:
# your code goes here

print(X[4,:])

tensor([0.4375, 0.9377, 0.5235])


### 1.5.2 Write code to print all the elements in the middle column of **X**.
rubric={accuracy:1}

In [15]:
# your code goes here

print(X[:,1])

tensor([0.7058, 0.0772, 0.5331, 0.4545, 0.5159])


### 1.5.3 Write code to create a 3D tensor of size $1 \times 5 \times 3$ using the $5 \times 3$ values from **X** (unsqueeze operation)
rubric={accuracy:1}

(You'll often need to "squeeze" or "unsqueeze" tensors to make sure that the dimensions are correct for certain parts of your model.)

In [16]:
# your code goes here

y = X.unsqueeze(0)

### 1.5.4 Write code that converts the 3D tensor (created in the previous question (c)) back into 2D tensor (of size $5 \times 3$). (squeeze operation)
rubric={accuracy:1}

In [17]:
# your code goes here

y = y.squeeze(0)

## Exercise 2: Putting the "Multi" in Multilayer Perceptron

In T1 you did hyper parameters optimization on a 2 layer neural network. For this assignment, you'll be using the same dataset (20 Newsgroups) but you'll need to build a model that can support an arbitrary number of layers (rather than just two)! There are two tricks to help out with this, the first is *nn.sequential* which allows you to stack pytorch modules together in a *cascade* fashion (*cascade* here meaning one thing passed to another like a pipeline). The other trick is *nn.modulelist* which will allow us to keep track of lists of modules, which we can then iterate through to build out network and perform *forward* passes through the network.

#### nn.Sequential example

In [11]:
#Make sure you run these each time before you (re)initialize the network or data!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)


## Fake data for testing
x = torch.rand((5,10))


## Building the model.  
## Note: to use inside a larger model, set it as:  self.name_of_layer  = nn.Sequential(...)
## in the initialization function

example_model = nn.Sequential(
          nn.Linear(10,5),
          nn.ReLU(),
          nn.Softmax(dim=-1)
        )

## Forward pass:

output = example_model(x)

print(output)

tensor([[0.2510, 0.1498, 0.1444, 0.1669, 0.2878],
        [0.2776, 0.1573, 0.1573, 0.1692, 0.2386],
        [0.2137, 0.2039, 0.1484, 0.1878, 0.2462],
        [0.2488, 0.1647, 0.1614, 0.1614, 0.2638],
        [0.1998, 0.1814, 0.1532, 0.1532, 0.3123]], grad_fn=<SoftmaxBackward>)


### 2.1 A 5 layer MLP using nn.Sequential
rubric={accuracy:2, quality:2}

Using nn.Sequential build a network with the following parameters: 5 Hidden Layers, each with 40 hidden units, using a Tanh activation function between layers, 20% dropout on each layer, and finally a softmax output.

In [15]:
### My MLP Here!

class MLP_Seq(nn.Module): #feel free to change the inputs it takes
    def __init__(self, input_dim, hidden_units=20, output_classes=20, nonlin=nn.ReLU(), dropout=.5):  
        super(MLP_Seq, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim,hidden_units),
            nonlin,
            nn.Dropout(p=dropout),
            nn.Linear(hidden_units,hidden_units),
            nonlin,
            nn.Dropout(p=dropout),
            nn.Linear(hidden_units,hidden_units),
            nonlin,
            nn.Dropout(p=dropout),
            nn.Linear(hidden_units,hidden_units),
            nonlin,
            nn.Dropout(p=dropout),
            nn.Linear(hidden_units,hidden_units),
            nonlin,
            nn.Dropout(p=dropout),
            nn.Linear(hidden_units,output_classes),
            nn.Softmax(dim=-1)
        )

    def forward(self, X, **kwargs):
        return self.layers(X)


### 2.2 Training and testing our 5 layer MLP
rubric={accuracy:1, quality:1}

Now train the network on the 20 Newgroups training data, and report the Test set accuracy.

In [16]:
### My MLP training / testing code!
from sklearn import metrics
from skorch import NeuralNetClassifier

torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    MLP_Seq(input_dim=X_train.shape[1], hidden_units=40,nonlin=nn.Tanh(),dropout=.2),
    max_epochs=10,
    lr=0.1,
    optimizer=torch.optim.Adam,   
    optimizer__weight_decay=0.001,  #roughly equivalent to L2 regularization
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))

pred = net.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)


  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m3.3822[0m       [32m0.0521[0m        [35m3.2230[0m  4.9509
      2        [36m3.2460[0m       0.0517        [35m3.1112[0m  4.0364
      3        3.2678       [32m0.0526[0m        3.1324  4.0514
      4        [36m3.2426[0m       0.0499        3.2169  3.9956
      5        [36m3.2405[0m       [32m0.0614[0m        3.2076  3.9814
      6        [36m3.2279[0m       0.0411        3.3399  4.0851
      7        3.2493       0.0521        3.2356  3.9920
      8        3.2767       0.0526        3.2047  4.0560
      9        3.3164       0.0526        3.2044  3.9200
     10        3.2562       0.0521        3.1390  3.9988
0.052177376526818905


If you look at this and think it's garbage, think back to T1, we had to tune the hyperparameters of the network to get any good performance and we haven't done that at all with this model! We are also introducing 9 new hyperparameter choices: 3 for the choices of hidden units on the 3 new layers, 3 for the choices of activation functions of the 3 new layers, 3 for the choices of dropout on the 3 new layers. If we only looked at 2 choices for each of these new ones, it would take us 2^9 times longer to perform a grid search of these parameters! You could, however, still do a randomized search and expect to find a fairly reasonable performance configuration.  

In practice we often see major companies Google/Facebook etc. to publish results that have known good hyperparameter configurations for common models (e.g. Transformer), these configurations often just get used "as is" by other researchers to save on the enormous amount of time in doing neural architecture search.

#### nn.ModuleList example


In [12]:

#Make sure you run these each time before you (re)initialize the network or data!!! 
torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)


## Fake data for testing
x = torch.rand((5,10))

layers = nn.ModuleList()   ## Again you'll need to assign as: self.layers = nn.ModuleList()   in the initialization function
## Two layers
for i in range(2):
    layers.append(nn.Linear(10,10))
    layers.append(nn.ReLU())
layers.append(nn.Linear(10,5))  #output layer
layers.append(nn.Softmax(dim=-1))

## Forward pass:
for layer in layers:
    x = layer(x)

print(x)

tensor([[0.1650, 0.2360, 0.2021, 0.2144, 0.1825],
        [0.1614, 0.2381, 0.2030, 0.2141, 0.1834],
        [0.1638, 0.2382, 0.2023, 0.2146, 0.1811],
        [0.1626, 0.2370, 0.2024, 0.2158, 0.1822],
        [0.1650, 0.2403, 0.2015, 0.2112, 0.1820]], grad_fn=<SoftmaxBackward>)


### 2.3 Arbitrary depth MLP using nn.ModuleList()
rubric={accuracy:2, quality:2}

Using nn.ModuleList to build a MLP class that takes input for number of layers, input/output dimensions, hidden units, activation function, and dropout percent and builds the corresponding network.  Output should pass through appropriate layers to a final softmax.

In [18]:
### My MLP Here!
class MLP_Mod(nn.Module): 
    def __init__(self, input_dim,num_layers=5, hidden_units=20, output_classes=20, nonlin=nn.ReLU(), dropout=.5):  
        super(MLP_Mod, self).__init__()
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            if i==0:
                self.layers.append(nn.Linear(input_dim,hidden_units))
            else:
                self.layers.append(nn.Linear(hidden_units,hidden_units))
            self.layers.append(nonlin)
            self.layers.append(nn.Dropout(p=dropout))
        self.layers.append(nn.Linear(hidden_units,output_classes))
        self.layers.append(nn.Softmax(dim=-1))

    def forward(self, X, **kwargs):
        for layer in self.layers:
            X = layer(X)
        return X


### 2.4 Training and testing a 6 layer MLP
rubric={accuracy:1, quality:1}

Using the class you built in 2.3, build and train a 6 layer MLP network on the 20 Newgroups training data, and report the Test set accuracy. You should set hidden units per layer to 50 units, dropout percent to 20%, and use sigmoid as your activation.

In [19]:
### My MLP training / testing code!

from sklearn import metrics
from skorch import NeuralNetClassifier

torch.manual_seed(manual_seed)
np.random.seed(manual_seed)
torch.cuda.manual_seed(manual_seed)

net = NeuralNetClassifier(
    MLP_Mod(input_dim=X_train.shape[1],num_layers=6, hidden_units=50,nonlin=nn.Sigmoid(),dropout=.2),
    max_epochs=10,
    lr=0.1,
    optimizer=torch.optim.Adam,   
    optimizer__weight_decay=0.001,  #roughly equivalent to L2 regularization
    iterator_train__shuffle=True,
    device=device  #'cpu' or 'cuda'
)

net.fit(torch.from_numpy(X_train.todense()).float(), torch.tensor(y_train,dtype=torch.long))

pred = net.predict(torch.from_numpy(X_test.todense()).float())
acc = metrics.accuracy_score(y_test, pred)
print(acc)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m3.0247[0m       [32m0.0526[0m        [35m2.9990[0m  4.6299
      2        [36m3.0026[0m       0.0526        [35m2.9967[0m  4.2137
      3        3.0048       0.0521        3.0003  4.1292
      4        3.0082       0.0526        2.9969  4.2345
      5        3.0027       0.0521        3.0210  4.4798
      6        3.0129       0.0526        3.0039  4.4293
      7        3.0138       0.0521        3.0064  4.2616
      8        3.0046       0.0526        [35m2.9961[0m  4.2798
      9        3.0063       0.0486        3.0011  4.7234
     10        3.0046       0.0526        2.9971  4.3684
0.0524429102496017


Garbage yet again! The B-plot of this weeks DSCI 572 episode is "Hyperparameter tuning is really important in deep learning!"

## Exercise 3: Very-Short answer questions

(Double-click each question block and place your answer at the end of the question) 

### 3.1 What is NumPy? What are the differences between PyTorch's Tensor and NumPy Array?
rubric={reasoning:1}
NumPy is a Python scientific computing package with support for linear algebra and some machine learning tools. The main difference between PyTorch Tensors and Numpy high dimension arrays is  basically the names that they are called, and the fact that PyTorch can load its tensors onto CUDA capable devices.

### 3.2 What is the key difference between ``torch.LongTensor`` and ``torch.cuda.LongTensor``?
rubric={reasoning:1}

The cuda version has been loaded onto a GPU.

### 3.3 What is the default data type of a PyTorch tensor?
rubric={accuracy:1}

float32  You can check this yourself as below by creating a tensor (without data).


In [24]:
x = torch.tensor(())
print(x.dtype)

torch.float32


### 3.4 What is ``autograd`` in PyTorch? How is it related to computational graph?
rubric={reasoning:1}

Autograd allows PyTorch to automatically calculate the gradient of functions. In our computational graph certain nodes can store the gradient as they are changed, thereby allowing for fast computation of backward propagation to learn the weights.

### 3.5 What is SGD? How is it different from Gradient Descent? Finally what role SGD plays in building machine learning models?
rubric={reasoning:1}

Stoachastic Gradient Descent is a technique to train a machine learning model by sampling from the training set, calculating the loss of a sample and slightly update the weights of the model based on the loss/gradient from that sample. Generally in SGD you slow down as you see more samples, eventually converging on a stable set of training weights. It's an extremely useful tool in training many machine learning models, particularly neural networks.

## (Not Graded)  Linearities, Non-linearities, Loss functions by Hand

This is a quick linear algebra review exercise, if you're a little rusty, or if you haven't taken a linear algebra class, this is a peak under the hood to show the math that Pytorch is doing when it's calculating tensor operations.  (See the separate solutions in the github lab folder to check your work)

Sample question:

In [15]:
linear_layer = torch.nn.Linear(5, 1)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

tensor([168.], grad_fn=<AddBackward0>)


Compute the values in **model\_out** by hand. Show your work.

Sample answer: (write it in markdown, not as code. if you don't like markdown, you can write the steps in a piece of paper, take a photo and attach an image in the answer block)

your answer goes here:

$model\_out = A x + b = [1, 2, 3, 4, 5] * [0, 10, 20, 15, 5] + 3 = (1*0 + 2*10 + 3*20 + 4*15 + 5*5) + 3 = 165 + 3 = 168 $


### 4.1

In [14]:
linear_layer = torch.nn.Linear(5, 1, bias=False)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

tensor([165.], grad_fn=<SqueezeBackward3>)


### Compute the values in **model\_out** by hand. Show your work.

your answer goes here (double-click this block to edit):

(There is small error in the example, which we can fix by just transposing the $x$ vector from a row ($1 \times 5$) vector to a column ($5 \times 1$) vector).  I'll also start using the convention of calling our weight matrix $W$, this is the same as the $A$ matrix in the example, but a more standard convention.
$model\_out = W x^T + b = [1, 2, 3, 4, 5] * [0, 10, 20, 15, 5]^T = (1*0 + 2*10 + 3*20 + 4*15 + 5*5)  = 165  = 165 $



your answer goes here (double-click this block to edit):

### 4.2

In [17]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 0, 0, 0, 1]) # sets the weight value
linear_layer.bias.data = torch.tensor([1]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 1]).float())
sigmoid_out = torch.nn.Sigmoid()(model_out)
print(sigmoid_out)

tensor([1.0000, 0.8808], grad_fn=<SigmoidBackward>)


### Compute the values in **sigmoid\_out** by hand. Show your work.


your answer goes here:

your answer goes here:
$W$ is a $2 \times 5$ matrix with each row consisting of the weights of a particular node.

We've defined $x$ as a $1 \times 5$ row vector, which we'll again need to transpose to match dimensions with $W$.

$Wx^T$ after multiplication should have dimensions $2 \times 1$, which is a column vector, but PyTorch unfortunately adopts a row notation for convenience, so just think of the output as being transposed (in the next problem I'll show how we can re-write things to better align with pytorch)

$W x^T + b = [[1, 2, 3, 4, 5],[1,0,0,0,1]] * [0, 10, 20, 15, 1]^T + [1,1]^T $

$= [(1*0 + 2*10 + 3*20 + 4*15 + 5*1) + 1,(1*0+0*10+0*20+0*15+1*1)+1]^T = [146,2]^T $  

now take the sigmoid of this (applying it element-wise)

$sigmoid([146,2]^T) = [\frac{1}{1+exp(-146)},\frac{1}{1+exp(-2)}]^T = [1.0,0.88]^T$


### 4.3

In [18]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
softmax_out = torch.nn.Softmax(dim=1)(model_out)
print(softmax_out)

tensor([[1.0000, 0.0000],
        [0.9933, 0.0067]], grad_fn=<SoftmaxBackward>)


### Compute the values in **softmax\_out** by hand. Show your work.


your answer goes here:

For convenience, with batches I tend to personally prefer using a row notation for the matrix of $X$. This happens to align with how pytorch does things, but if you ended up getting your output transposed, just note the convention difference.  In order for row matrix form of $X$ to work, the equation is going to change slightly. Instead of $Wx^T + b$ I'll use $XW^T + b$

With $X$ as a $2 \times 5$ matrix consisting of the rows of training samples.
$W^T$ $5 \times 2$ matrix consisting of the weights for each node now transposed into the columns of the matrix.

I'll also use a $2 \times 2$ matrix of ones to deal with the bias being added (since it is the same for each node)

our output will thus be a $2 \times 2$ matrix with each row consisting of the softmax applied to a particular training sample.

$XW^T + b = [[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]\times [[1,2,3,4,5],[1,3,0,0,10]]^T + 3 \times OnesMatrix$

$= [[ (100*1+10*2 + 20*3 + 15*4 + 1*5 +3), (100*1 + 10*3 + 20*0 + 15 * 0 + 1*10  +3)],[(10*1+5*2+2*3+1*4+0*5 +3), (10*1+5*3+2*0+1*0+0*0  +3)]]$

$ = [[248,143],[33,28]]$

Now apply softmax to each row:
$softmax([[248,143],[33,28]])  = [[\frac{exp(248)}{exp(248)+exp(143)}, \frac{exp(143)}{exp(248)+exp(143)}],[\frac{exp(33)}{exp(33)+exp(28)},\frac{exp(28)}{exp(33)+exp(28)}]] = [[1.0,0.0],[0.9933,0.0067]]$


your answer goes here:

### 4.4

In [19]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
criterion = torch.nn.MSELoss()
loss = criterion(model_out, torch.tensor([[245, 140], [30, 30]]).float())
print(loss)

tensor(7.7500, grad_fn=<MseLossBackward>)


your answer goes here:

Same idea as 2.3, but now we'll apply a loss function, the Mean Squared Error, defined as $\frac{1}{n}\sum_i^n(\tilde{y_i} -y_i)^2$ where $\tilde{y_i}$ is the prediction for one of our test samples, with $n$ being the number of samples.

First let's compute the $modelout$ for our inputs:

$XW^T + b  = [[100,10,20,15,1],[10,5,2,1,0]] \times [[1,2,3,4,5],[1,3,0,0,10]]^T  + 3 OnesMatrix$

$= [[(100*1+10*2+20*3+15*4+1*5+3),(100*1+10*3+20*0+15*0+1*10+3)],[(10*1+5*2+2*3+1*4+0*5+3),(10*1+5*3+2*0+1*0+0*10+3)]] $

$= [[248,143],[33,28]]$

Now let's apply the loss:
$\frac{1}{n}\sum_i^n(\tilde{y_i} -y_i)^2$

$\frac{1}{2}(mean([(248-245),(143-140)]^2) + mean([(33-30),(28-30)]^2) = \frac{1}{2} (\frac{9+9}{2} +\frac{9+4}{2}) = 7.75$
