This Tutorial is modified from [University of Washington CSE446](https://courses.cs.washington.edu/courses/cse446/19au/section9.html) and [PyTorch Official Tutorials](https://pytorch.org/tutorials/)

- Automatic differentiation is a powerful tool
- PyTorch implements common functions used in deep learning
- Data Processing with PyTorch DataSet
- Mixed Presision Training in PyTorch (with Nvdia GPU)

**If you are lazy, look at this [cheatsheet](https://hackmd.io/@rh0jTfFDTO6SteMDq91tgg/HkDRHKLrU#RNN-Recurrent-Neural-Networks) directly**

# Numpy and Pytorch

> Pytorch is just Numpy on GPU.

1. tensor is basically an (multidimential) array.   
2. Numpy methods generally have Pytorch alternatives

## Tensor = Array

In [4]:
import numpy as np
import torch


# create tensors
x_numpy = np.array([0.1,0.2,0.3])
x_torch = torch.tensor([0.1,0.2,0.3]) # list embedded


print('x_numpy, x_torch')
print(x_numpy, x_torch)

x_numpy, x_torch
[0.1 0.2 0.3] tensor([0.1000, 0.2000, 0.3000])


In [7]:
# transform btw numpy, torch
print(torch.from_numpy(x_numpy)) # from numpy
print(x_torch.numpy())   # to numpy


tensor([0.1000, 0.2000, 0.3000], dtype=torch.float64)
[0.1 0.2 0.3]


## simple  manipulate

many functions that are in numpy are also in pytorch


In [13]:
print("norm")
print(np.linalg.norm(x_numpy), torch.norm(x_torch))


norm
5.477225575051661 tensor(5.4772)


some have same function but dif key words


In [12]:
print("mean along the 0th dimension")
x_numpy = np.array([[1,2],[3,4.]])
x_torch = torch.tensor([[1,2],[3,4.]])
print(np.mean(x_numpy, axis=0), torch.mean(x_torch, dim=0)) # dim vs axis


mean along the 0th dimension
[2. 3.] tensor([2., 3.])


## reshape tensors

`Pytorch.view = Numpy.reshape`

Numpy version : **reshape**

In [95]:
X2 = np.random.rand(10000, 3, 28, 28)
print(X2.shape)

(10000, 3, 28, 28)


In [96]:
print(X2.reshape(N,C,28**2).shape)

(10000, 3, 784)


Torch Version : **view**

In [94]:
X = torch.randn((10000, 3, 28, 28))
print(X.shape)

torch.Size([10000, 3, 28, 28])


In [97]:
# reshape torch
print(X.view(10000,3,28**2).shape)   # use "view" built-in func

torch.Size([10000, 3, 784])


keep in mind that you did not actually **change the X** by Torch.View

In [99]:
# Unless you do this
X = X.view(10000, 3, 28**2)

 >time-varying dimension

In [98]:
print(X.view(-1,3,784).shape)  # auto choose dim
        # only one -1 required!

torch.Size([10000, 3, 784])


## torch broadcasting

 starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

In [101]:
x=torch.empty(5,1,4,1)
y=torch.empty(  3,1,2)
print((x+y).size())

torch.Size([5, 3, 4, 2])


# CUDA 

In [29]:
# copy tensor btw GPU and CPU
cpu = torch.device("cpu")
gpu = torch.device("cuda")

In [30]:
x = torch.rand(10)
print(x)  # in cpu defualt

x = x.to(gpu)
print(x)

tensor([0.0569, 0.5231, 0.5170, 0.9659, 0.3097, 0.8913, 0.7250, 0.3659, 0.5064,
        0.8923])
tensor([0.0569, 0.5231, 0.5170, 0.9659, 0.3097, 0.8913, 0.7250, 0.3659, 0.5064,
        0.8923], device='cuda:0')


# Pytorch AUTO Gradient

Consider the function $f(x) = (x-2)^2$.    
Q: Compute 微分 $\frac{d}{dx} f(x)$ and 微分值 then compute $f'(1)$.  

function: **computing all the gradients of `y` at once.**

In [32]:
# define func

def f(x):
    return (x-2)**2

x = torch.tensor([1.0], requires_grad=True)

y = f(x)
y.backward()
print(x.grad)

tensor([-2.])


## a high-dimentional, complex func
Let $w = [w_1, w_2]^T$

Consider $g(w) = 2w_1w_2 + w_2\cos(w_1)$

Q: Compute $\nabla_w g(w)$ and verify $\nabla_w g([\pi,1]) = [2, \pi - 1]^T$

In [33]:
def g(w):
    return 2*w[0]*w[1] + w[1]*torch.cos(w[0])

w = torch.tensor([np.pi, 1], requires_grad=True) # at certain point

z = g(w)
z.backward()  # auto cal all gradient

print(w.grad)

tensor([2.0000, 5.2832])


## pytorch and gradient descent

Let $f$ the same function we defined above.

Q: What is the value of $x$ that minimizes $f$?

In [37]:
def f(x):
    return (x-2)**2

y = f(x)


# begining point
# x is tensor! 
x = torch.tensor([5.0], requires_grad=True)   # tensor start point = 5


step_size = 0.25

print('iter, \tx, \tf(x),  \tf\'(x) pytorch') # a str

# move 15 steps
for i in range(15):
    y = f(x)
    y.backward() # compute the gradient 
      
                # precision = 3 decimal
    print('{}, \t{:.3f}, \t{:.3f}, \t{:.3f}'.format(i, x.item(), f(x).item(),  x.grad.item()))
    
    # x.data, not x
    x.data = x.data - step_size * x.grad # perform a GD update step
                                # x.grad must be calc separately if not apply torch
        
    # zero the grad variable each iteration since the backward()
    # call accumulates the gradients in .grad instead of overwriting.
    # The detach_() is for efficiency. 
    x.grad.detach_()
    x.grad.zero_()
        
   

iter, 	x, 	f(x),  	f'(x) pytorch
0, 	5.000, 	9.000, 	6.000
1, 	3.500, 	2.250, 	3.000
2, 	2.750, 	0.562, 	1.500
3, 	2.375, 	0.141, 	0.750
4, 	2.188, 	0.035, 	0.375
5, 	2.094, 	0.009, 	0.188
6, 	2.047, 	0.002, 	0.094
7, 	2.023, 	0.001, 	0.047
8, 	2.012, 	0.000, 	0.023
9, 	2.006, 	0.000, 	0.012
10, 	2.003, 	0.000, 	0.006
11, 	2.001, 	0.000, 	0.003
12, 	2.001, 	0.000, 	0.001
13, 	2.000, 	0.000, 	0.001
14, 	2.000, 	0.000, 	0.000


In [59]:
# print the last gradient (should be 0)
print(x.grad.view(1).detach().numpy())

[0.]


In [39]:
# simplify

# define the func
def f(x):
    return (x-2)**2
y = f(x)


# define the tensor
x = torch.tensor([5.0], requires_grad=True) 


# step size
step_size = 0.25

# update f(x)
for i in range(15):
    y = f(x)
    y.backward()
    x.data = x.data - step_size * x.grad
    x.grad.detach_()
    x.grad.zero_()

print('x',x,'y',y) 

x tensor([2.0001], requires_grad=True) y tensor([3.3528e-08], grad_fn=<PowBackward0>)


> when x = 2; we have y min = 0

# Linear Regression

apply to Loss func

In [41]:
# make a simpel linear dataset 

d = 2
n = 50
X = torch.randn(n, d)   # 2 vars

 # true parameters beta vector
true_beta = torch.tensor([[-1.0], [2.0]])   
        

y = X @ true_beta + torch.randn(n, 1) * 0.1  # add noise


# check
print('X shape', X.shape)
print('y shape', y.shape)
print('w shape', true_beta.shape)

X shape torch.Size([50, 2])
y shape torch.Size([50, 1])
w shape torch.Size([2, 1])


## Sanity check
the gradient for the RSS objective:

$$ \widehat{y} = X\beta$$

$$\nabla_w \mathcal{L}_{RSS}(w; X) = \nabla_\beta\frac{1}{n} ||y - \widehat{y}||_2^2  $$   
$$= -\frac{2}{n}X^T(y-\widehat{y})$$



In [83]:
# define a linear model
def model(X, beta):
    return X @ beta

# loss func: the residual (sum of square)
def rss(y, y_hat):
    return torch.norm(y - y_hat)**2 / n

# set start point
beta = torch.tensor([[1.],[0]], requires_grad=True)
y_hat = model(X, beta)

# cal gradient of rss
loss = rss(y, y_hat)
loss.backward()  # cal gradient for loss func

print('first gradient', beta.grad.view(2).numpy())
# the first 

first gradient [ 4.251404 -4.610432]


> the gradient of the first update 

#### Analytical Solution

In [49]:
def analytic_rss(X, y, beta):
    return -2 * X.t() @ (y - X @ beta) / n

beta = torch.tensor([[1.],[0]], requires_grad=True) # begin with 1, 0 

analytic_rss(X, y, beta).detach().view(2).numpy()

array([ 4.2514043, -4.610432 ], dtype=float32)

> should be the same as Pytorch

## Gradient Descent

redefine all functions to show the whole process

In [80]:

# define estimated model y_hat
def model(X, beta):
    return X @ beta



# define loss func (rss in this case)
def rss(y, y_hat):
    return torch.norm(y - y_hat)**2 / n


# define tensor  (parameters under estimation)
beta = torch.tensor([[1.],[0]], requires_grad=True)


############### GRADIENT DESCENT ###############
step_size = 0.1


# print 3 things (title)
print('iter, \tloss, \tgrad, \tbeta')

# update (loop)
   # update 20 times

for i in range(20):
    
    # redefine every iteration
    y_hat = model(X, beta)
    loss = rss(y, y_hat)

    
    
    # step 1: cal gradient 
    loss.backward() # auto apply to tensor (which is beta)
    
    # step 2: update para
    beta.data = beta.data - step_size * beta.grad 
    #print(beta.grad.numpy())
    
    print('{},\t{:2f}, \t{}, \t{}'.format(i, loss.item(), beta.grad.numpy().transpose(), beta.view(2).detach().numpy().transpose()))
    
    # zero grad since backward
    # accumulates gradients instead of overwrtite
    
    beta.grad.detach()# for efficiency
    beta.grad.zero_()
    

    

iter, 	loss, 	grad, 	beta
0,	8.897041, 	[[ 4.251404 -4.610432]], 	[0.5748596 0.4610432]
1,	5.399818, 	[[ 3.3413305 -3.5590112]], 	[0.24072656 0.81694436]
2,	3.280635, 	[[ 2.6282098 -2.745397 ]], 	[-0.02209443  1.0914841 ]
3,	1.996026, 	[[ 2.0690796 -2.1160963]], 	[-0.2290024  1.3030937]
4,	1.216986, 	[[ 1.6304004 -1.6296082]], 	[-0.39204246  1.4660544 ]
5,	0.744307, 	[[ 1.2859839 -1.2537354]], 	[-0.52064085  1.591428  ]
6,	0.457339, 	[[ 1.0153735  -0.96350896]], 	[-0.6221782  1.687779 ]
7,	0.282997, 	[[ 0.80258375 -0.73956835]], 	[-0.70243657  1.7617358 ]
8,	0.176992, 	[[ 0.63511825 -0.5669064 ]], 	[-0.7659484  1.8184265]
9,	0.112476, 	[[ 0.50320435 -0.43389428]], 	[-0.81626886  1.8618159 ]
10,	0.073165, 	[[ 0.39919543 -0.33152306]], 	[-0.8561884  1.8949683]
11,	0.049182, 	[[ 0.3171054 -0.2528169]], 	[-0.887899   1.9202499]
12,	0.034527, 	[[ 0.25224566 -0.1923753 ]], 	[-0.91312355  1.9394875 ]
13,	0.025557, 	[[ 0.2009418  -0.14602035]], 	[-0.9332177  1.9540895]
14,	0.020054, 	[[ 0.1603

> beta = [-1,2]; very close !

# NN module

`Module` is PyTorch's way of performing operations on tensors. Modules are implemented as subclasses of the `torch.nn.Module` class. All modules are callable and can be composed together to create complex functions.

[`torch.nn` docs](https://pytorch.org/docs/stable/nn.html)

Note: most of the functionality implemented for modules can be accessed in a functional form via `torch.nn.functional`, but these require you to create and manage the weight tensors yourself.

[`torch.nn.functional` docs](https://pytorch.org/docs/stable/nn.html#torch-nn-functional).

## Linear Transformation
Unlike how we initialized our $\beta$ manually, the Linear module automatically initializes the **weights** randomly.

[`torch.nn.init` docs](https://pytorch.org/docs/stable/nn.html#torch-nn-init)

Linear module does a linear transformation with a bias.    
It takes the input and output dim as para, then creates the weights of the object.

In [85]:
import torch.nn as nn
import torch.nn.functional as F

see [documentation](https://pytorch.org/docs/master/generated/torch.nn.Linear.html)

In [92]:


# take in, out dim as para
m = nn.Linear(2, 3) # as a func


# an example (dim = 3)
input = torch.randn(9, 2)

# apply a linear transformation to the data
transformed = m(input)



print('in', example_tensor)
print('out',transformed)

in tensor([[1., 2., 3.],
        [4., 5., 6.]])
out tensor([[-0.6987, -0.4822,  0.1826],
        [-0.0055,  0.3998, -0.5024],
        [-0.5060, -0.2308,  0.0036],
        [-0.1410,  0.0688, -0.6597],
        [-0.3254,  0.0573, -0.0678],
        [ 0.2328,  0.8916, -0.3916],
        [-1.5062, -1.3886,  1.2027],
        [-0.4364, -0.4086, -0.5541],
        [-0.8600, -0.8621,  0.0213]], grad_fn=<AddmmBackward>)


# Activation Func

PyTorch implements a number of activation functions including but not limited to `ReLU`, `Tanh`, and `Sigmoid`. Since they are modules, they need to be instantiated.

In [93]:
# define active func
activ_fn = nn.Sigmoid()

# input tensor
ex_tensor = torch.tensor([-1,1,0])

# 
activated  = activ_fn(ex_tensor)

