##  Homework 1 - Supervised Learning II - MDS Computational Linguistics 

###  Assignment Topics 
- Operations on Tensor
- Linearities, Non-linearities and Loss functions 
- Linear Regression on single input feature
- Linear Regression on multiple input features
- Very-short answer questions

###  Software Requirements 
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Matplotlib (>=3.1.2)
- Jupyter (latest)

###  Submission Info. 
- Due Date: January 18, 2020, 18:00:00 (Vancouver time)

##  Getting Started 

In [4]:
# all necessary imports
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# set the seed (allows reproducibility of the results)
manual_seed = 123
random.seed(manual_seed) # allows us to reproduce results when using functions in random class
np.random.seed(manual_seed) # allows us to reproduce results when using random generation on the numpy
torch.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # checks if GPU is there in this system and automatically uses GPU if its available, otherwise uses CPU.
n_gpu = torch.cuda.device_count() # gets the number of GPUs in this system
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed) # allows us to reproduce results when using random generation on the gpu

##  Tidy Submission 

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

##  Exercise 1: Operations on Tensor 

### 1.1 Write code that creates a tensor, **X** of size $5 \times 5$ containing longs with values initialized to ones. 
rubric={accuracy:1}

In [5]:
# your code goes here
x = torch.tensor((),dtype=torch.long).new_ones((5,5))

### 1.2 Write code that takes the tensor, **X** (from the previous question 1.1) and sets the values along the diagonal to two.
rubric={accuracy:1}

In [6]:
# your code goes here

mask = torch.eye(5,5).bool()  #create a mask of the diagonal
x.masked_fill_(mask,2)

tensor([[2, 1, 1, 1, 1],
        [1, 2, 1, 1, 1],
        [1, 1, 2, 1, 1],
        [1, 1, 1, 2, 1],
        [1, 1, 1, 1, 2]])

### 1.3 Write code that takes the tensor, **X** (from the previous question 1.2), squares all the values in **X**, sums all the squared values in **X** and prints the square root of this sum? (L2-norm) 
rubric={accuracy:1}

In [7]:
# your code goes here
#need to convert to a floating point type
x = x.double()
print(torch.norm(x))

print(torch.sqrt(torch.sum(torch.mul(x,x))))


tensor(6.3246, dtype=torch.float64)
tensor(6.3246, dtype=torch.float64)


## 1.4 Given the following two tensors, **X** $\in \mathcal{R}^{4\times4}$ and $\textbf{Y} \in \mathcal{R}^{4\times4}$

In [8]:
X = torch.rand(4,4)
print(X)
Y = torch.rand(4,4)
print(Y)

tensor([[0.2961, 0.5166, 0.2517, 0.6886],
        [0.0740, 0.8665, 0.1366, 0.1025],
        [0.1841, 0.7264, 0.3153, 0.6871],
        [0.0756, 0.1966, 0.3164, 0.4017]])
tensor([[0.1186, 0.8274, 0.3821, 0.6605],
        [0.8536, 0.5932, 0.6367, 0.9826],
        [0.2745, 0.6584, 0.2775, 0.8573],
        [0.8993, 0.0390, 0.9268, 0.7388]])


### 1.4.1 Write code that performs standard matrix multiplication, multiply **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [9]:
# your code goes here

print(torch.matmul(X,Y))
#or...
print(X@Y)

tensor([[1.1644, 0.7440, 1.1501, 1.4276],
        [0.8781, 0.6691, 0.7129, 1.0931],
        [1.3464, 0.8175, 1.2572, 1.6133],
        [0.6250, 0.4032, 0.6143, 0.8112]])
tensor([[1.1644, 0.7440, 1.1501, 1.4276],
        [0.8781, 0.6691, 0.7129, 1.0931],
        [1.3464, 0.8175, 1.2572, 1.6133],
        [0.6250, 0.4032, 0.6143, 0.8112]])


### 1.4.2 Write code that performs standard addition of two matrices, add **X** and **Y** without changing their values and prints the result.
rubric={accuracy:1}

In [10]:
# your code goes here
print(X + Y)


tensor([[0.4147, 1.3440, 0.6338, 1.3491],
        [0.9275, 1.4597, 0.7733, 1.0851],
        [0.4586, 1.3848, 0.5928, 1.5444],
        [0.9750, 0.2357, 1.2432, 1.1405]])


### 1.4.3 Write code that subtracts matrix **Y** from **X** without changing their values and prints the result.
rubric={accuracy:1}

In [11]:
# your code goes here

print(X - Y)

tensor([[ 0.1775, -0.3108, -0.1304,  0.0281],
        [-0.7796,  0.2734, -0.5001, -0.8802],
        [-0.0904,  0.0681,  0.0377, -0.1702],
        [-0.8237,  0.1576, -0.6104, -0.3370]])


### 1.4.4 Write code that performs standard matrix multiplication, multiply **X** and **Y** and placing the results directly in **X** (modifying **X**) and prints the result.
rubric={accuracy:1}

In [12]:
# your code goes here
print(X)
X = torch.matmul(X,Y)
print(X)

tensor([[0.2961, 0.5166, 0.2517, 0.6886],
        [0.0740, 0.8665, 0.1366, 0.1025],
        [0.1841, 0.7264, 0.3153, 0.6871],
        [0.0756, 0.1966, 0.3164, 0.4017]])
tensor([[1.1644, 0.7440, 1.1501, 1.4276],
        [0.8781, 0.6691, 0.7129, 1.0931],
        [1.3464, 0.8175, 1.2572, 1.6133],
        [0.6250, 0.4032, 0.6143, 0.8112]])


## 1.5 Given the following tensor, **X** $\in \mathcal{R}^{5\times3}$

In [13]:
X = torch.rand(5,3)
print(X)

tensor([[0.7179, 0.7058, 0.9156],
        [0.4340, 0.0772, 0.3565],
        [0.1479, 0.5331, 0.4066],
        [0.2318, 0.4545, 0.9737],
        [0.4606, 0.5159, 0.4220]])


### 1.5.1 Write code to print all the elements in the last row of **X**.
rubric={accuracy:1}

In [14]:
# your code goes here

print(X[4,:])

tensor([0.4606, 0.5159, 0.4220])


### 1.5.2 Write code to print all the elements in the middle column of **X**.
rubric={accuracy:1}

In [15]:
# your code goes here

print(X[:,1])

tensor([0.7058, 0.0772, 0.5331, 0.4545, 0.5159])


### 1.5.3 Write code to create a 3D tensor of size $1 \times 5 \times 3$ using the $5 \times 3$ values from **X** (unsqueeze operation)
rubric={accuracy:1}

In [16]:
# your code goes here

y = X.unsqueeze(0)

### 1.5.4 Write code that converts the 3D tensor (created in the previous question (c)) back into 2D tensor (of size $5 \times 3$). (squeeze operation)
rubric={accuracy:1}

In [17]:
# your code goes here

y = y.squeeze(0)

##  Exercise 2: Linearities, Non-linearities, Loss functions

Sample question:

In [18]:
linear_layer = torch.nn.Linear(5, 1)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

tensor([168.], grad_fn=<AddBackward0>)


Compute the values in **model\_out** by hand. Show your work.

Sample answer: (write it in markdown, not as code. if you don't like markdown, you can write the steps in a piece of paper, take a photo and attach an image in the answer block)

your answer goes here:

$model\_out = A x + b = [1, 2, 3, 4, 5] * [0, 10, 20, 15, 5] + 3 = (1*0 + 2*10 + 3*20 + 4*15 + 5*5) + 3 = 165 + 3 = 168 $


### 2.1

In [19]:
linear_layer = torch.nn.Linear(5, 1, bias=False)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 5]).float())
print(model_out)

tensor([165.], grad_fn=<SqueezeBackward3>)


### Compute the values in **model\_out** by hand. Show your work.
rubric={accuracy:2}

your answer goes here (double-click this block to edit):

(There is small error in the example, which we can fix by just transposing the $x$ vector from a row ($1 \times 5$) vector to a column ($5 \times 1$) vector).  I'll also start using the convention of calling our weight matrix $W$, this is the same as the $A$ matrix in the example, but a more standard convention.
$model\_out = W x^T + b = [1, 2, 3, 4, 5] * [0, 10, 20, 15, 5]^T = (1*0 + 2*10 + 3*20 + 4*15 + 5*5)  = 165  = 165 $



### 2.2

In [20]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 0, 0, 0, 1]) # sets the weight value
linear_layer.bias.data = torch.tensor([1,1]).float() # sets the bias value
model_out = linear_layer(torch.tensor([0, 10, 20, 15, 1]).float())
sigmoid_out = torch.nn.Sigmoid()(model_out)
print(sigmoid_out)

tensor([1.0000, 0.8808], grad_fn=<SigmoidBackward>)


### Compute the values in **sigmoid\_out** by hand. Show your work.
rubric={accuracy:2}

your answer goes here:
$W$ is a $2 \times 5$ matrix with each row consisting of the weights of a particular node.

We've defined $x$ as a $1 \times 5$ row vector, which we'll again need to transpose to match dimensions with $W$.

$Wx^T$ after multiplication should have dimensions $2 \times 1$, which is a column vector, but PyTorch unfortunately adopts a row notation for convenience, so just think of the output as being transposed (in the next problem I'll show how we can re-write things to better align with pytorch)

$W x^T + b = [[1, 2, 3, 4, 5],[1,0,0,0,1]] * [0, 10, 20, 15, 1]^T + [1,1]^T $

$= [(1*0 + 2*10 + 3*20 + 4*15 + 5*1) + 1,(1*0+0*10+0*20+0*15+1*1)+1]^T = [146,2]^T $  

now take the sigmoid of this (applying it element-wise)

$sigmoid([146,2]^T) = [\frac{1}{1+exp(-146)},\frac{1}{1+exp(-2)}]^T = [1.0,0.88]^T$


### 2.3

In [21]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
softmax_out = torch.nn.Softmax(dim=1)(model_out)
print(softmax_out)

tensor([[1.0000, 0.0000],
        [0.9933, 0.0067]], grad_fn=<SoftmaxBackward>)


### Compute the values in **softmax\_out** by hand. Show your work.
rubric={accuracy:2}

your answer goes here:

For convenience, with batches I tend to personally prefer using a row notation for the matrix of $X$. This happens to align with how pytorch does things, but if you ended up getting your output transposed, just note the convention difference.  In order for row matrix form of $X$ to work, the equation is going to change slightly. Instead of $Wx^T + b$ I'll use $XW^T + b$

With $X$ as a $2 \times 5$ matrix consisting of the rows of training samples.
$W^T$ $5 \times 2$ matrix consisting of the weights for each node now transposed into the columns of the matrix.

I'll also use a $2 \times 2$ matrix of ones to deal with the bias being added (since it is the same for each node)

our output will thus be a $2 \times 2$ matrix with each row consisting of the softmax applied to a particular training sample.

$XW^T + b = [[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]\times [[1,2,3,4,5],[1,3,0,0,10]]^T + 3 \times OnesMatrix$

$= [[ (100*1+10*2 + 20*3 + 15*4 + 1*5 +3), (100*1 + 10*3 + 20*0 + 15 * 0 + 1*10  +3)],[(10*1+5*2+2*3+1*4+0*5 +3), (10*1+5*3+2*0+1*0+0*0  +3)]]$

$ = [[248,143],[33,28]]$

Now apply softmax to each row:
$softmax([[248,143],[33,28]])  = [[\frac{exp(248)}{exp(248)+exp(143)}, \frac{exp(143)}{exp(248)+exp(143)}],[\frac{exp(33)}{exp(33)+exp(28)},\frac{exp(28)}{exp(33)+exp(28)}]] = [[1.0,0.0],[0.9933,0.0067]]$


### 2.4

In [22]:
linear_layer = torch.nn.Linear(5, 2)
linear_layer.weight.data[0] = torch.tensor([1, 2, 3, 4, 5]) # sets the weight value
linear_layer.weight.data[1] = torch.tensor([1, 3, 0, 0, 10]) # sets the weight value
linear_layer.bias.data = torch.tensor([3]).float() # sets the bias value
model_out = linear_layer(torch.tensor([[100, 10, 20, 15, 1], [10, 5, 2, 1, 0]]).float())
criterion = torch.nn.MSELoss()
loss = criterion(model_out, torch.tensor([[245, 140], [30, 30]]).float())
print(loss)

tensor(7.7500, grad_fn=<MseLossBackward>)


### Compute the values in **loss** by hand. Show your work.
rubric={accuracy:2}

your answer goes here:

Same idea as 2.3, but now we'll apply a loss function, the Mean Squared Error, defined as $\frac{1}{n}\sum_i^n(\tilde{y_i} -y_i)^2$ where $\tilde{y_i}$ is the prediction for one of our test samples, with $n$ being the number of samples.

First let's compute the $modelout$ for our inputs:

$XW^T + b  = [[100,10,20,15,1],[10,5,2,1,0]] \times [[1,2,3,4,5],[1,3,0,0,10]]^T  + 3 OnesMatrix$

$= [[(100*1+10*2+20*3+15*4+1*5+3),(100*1+10*3+20*0+15*0+1*10+3)],[(10*1+5*2+2*3+1*4+0*5+3),(10*1+5*3+2*0+1*0+0*10+3)]] $

$= [[248,143],[33,28]]$

Now let's apply the loss:
$\frac{1}{n}\sum_i^n(\tilde{y_i} -y_i)^2$

$\frac{1}{2}(mean([(248-245),(143-140)]^2) + mean([(33-30),(28-30)]^2) = \frac{1}{2} (\frac{9+9}{2} +\frac{9+4}{2}) = 7.75$


In [23]:
#in code:
y1 = torch.tensor([248.,143.])
y1_t = torch.tensor([245.,140.])
y2 = torch.tensor([33.,28.])
y2_t = torch.tensor([30.,30.])

print (((y1 - y1_t)**2).mean())
print (((y2 - y2_t)**2).mean())
print((9. + 6.5)/2)

tensor(9.)
tensor(6.5000)
7.75


## Exercise 3: Very-Short answer questions 

(Double-click each question block and place your answer at the end of the question) 

### 3.1 What is NumPy? What are the differences between PyTorch's Tensor and NumPy?
rubric={reasoning:2}

NumPy is a Python scientific computing package with support for linear algebra and some machine learning tools. The main difference between PyTorch Tensors and Numpy high dimension arrays is  basically the names that they are called, and the fact that PyTorch can load its tensors onto CUDA capable devices.

### 3.2 What is the key difference between ``torch.LongTensor`` and ``torch.cuda.LongTensor``?
rubric={reasoning:2}

The cuda version has been loaded onto a GPU.

### 3.3 What is the default data type of a PyTorch tensor?
rubric={accuracy:1}

float32  You can check this yourself as below by creating a tensor (without data).


In [24]:
x = torch.tensor(())
print(x.dtype)

torch.float32


### 3.4 What is ``autograd`` in PyTorch? How is it related to computational graph?
rubric={reasoning:2}

Autograd allows PyTorch to automatically calculate the gradient of functions. In our computational graph certain nodes can store the gradient as they are changed, thereby allowing for fast computation of backward propagation to learn the weights.

### 3.5 What is SGD? What role SGD plays in building machine learning models?
rubric={reasoning:2}

Stoachastic Gradient Descent is a technique to train a machine learning model by sampling from the training set, calculating the loss of a sample and slightly update the weights of the model based on the loss/gradient from that sample. Generally in SGD you slow down as you see more samples, eventually converging on a stable set of training weights. It's an extremely useful tool in training many machine learning models, particularly neural networks.