# Torch Tensors 
### Notes and code based on youtube videos by user Patrick Loeber: https://www.youtube.com/playlist?list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4

In [2]:
import torch
import numpy as np

In [3]:
x = torch.randint(20, size=(2,2))
y = torch.randint(6, size=(2,2))

print('x = ', x, 'y = ', y)

x =  tensor([[14, 16],
        [19, 17]]) y =  tensor([[1, 4],
        [1, 4]])


Element-wise operations written as $\textit{x+y, x-y, x*y, x/y}$ or $\textit{x.add(y)}$. Add a trailing underscore to do the operation in place. 

In [4]:
print(x + y)
y.add_(x)
print(y)
if y.all() == (x+y).all():
    print("Same thing.")

tensor([[15, 20],
        [20, 21]])
tensor([[15, 20],
        [20, 21]])
Same thing.


Can use .item() to extract (scalar) tensor component as long as the tensor is sliced properly.

In [5]:
z = torch.randint(10, (2, 2))
print(z)
scalar = z[0, 1]
print(scalar, scalar.item())
print(type(scalar), type(scalar.item()))

tensor([[2, 1],
        [9, 2]])
tensor(1) 1
<class 'torch.Tensor'> <class 'int'>


$\textit{t.view()}$ can be used to reshape a tensor. Reshape dimensions must have the same product as previous dimension product ie $m \times n$ maps to $a \times b$ where $mn = ab$.

In [6]:
a = torch.rand(4, 4)
print(a)
b = a.view(-1, 2) #reshape -1 makes automatic choice of other dimension when given one dimension as
#next input
print(b)

tensor([[0.0340, 0.4746, 0.8292, 0.3899],
        [0.7197, 0.5315, 0.5914, 0.2985],
        [0.3013, 0.6627, 0.2664, 0.9765],
        [0.1281, 0.7961, 0.7034, 0.1579]])
tensor([[0.0340, 0.4746],
        [0.8292, 0.3899],
        [0.7197, 0.5315],
        [0.5914, 0.2985],
        [0.3013, 0.6627],
        [0.2664, 0.9765],
        [0.1281, 0.7961],
        [0.7034, 0.1579]])


Can turn a numpy array into a torch tensor and vice versa. Care needs to be taken though as the tensor and array both occupy the same memory location if only working with the CPU (not GPU) so changes to one are also changes to the other.

In [7]:
o = torch.ones(5)
print(o)
p = o.numpy() 
print(type(p)) 

s = np.ones(5)
print(s)
t = torch.from_numpy(s)
print(type(t))

tensor([1., 1., 1., 1., 1.])
<class 'numpy.ndarray'>
[1. 1. 1. 1. 1.]
<class 'torch.Tensor'>


When initialising a tensor, setting the $\textit{requires\_grad}$ equal to True let's torch know that you will want to take the gradient of this tensor later on. This attribute needs to be specified before back propagation.

In [8]:
x = torch.randn(3, requires_grad=True) 
print(x)

y = x + 2
print(y)

tensor([ 0.0234, -0.6829,  0.8053], requires_grad=True)
tensor([2.0234, 1.3171, 2.8053], grad_fn=<AddBackward0>)


Since requires_grad is True, torch tracks the operations made in the grad_fn attribute of the tensors and creates a computation graph for back propagation later. The forward pass calculates the output $y = f(x)$ and since gradient is specified, torch automatically creates a function for us which is used in back propagation to calculate the gradient. $y$ has attribute grad_fn which points to gradient function dy/dx.

In [12]:
z = y*y*2
print(z)
v = torch.tensor([0.1, 1.0, 0.001], dtype = torch.float32)
z.backward(v)
print(x.grad)

tensor([ 8.1884,  3.4694, 15.7394], grad_fn=<MulBackward0>)
tensor([ 1.6187, 10.5366,  0.0224])


$\textit{z.backward()}$ calculates the derivative of z wrt the first tensor in the branch chain ie calculates dz/dx. However, $\textit{z.backward()}$, doesn't work by itself unless the output is a scalar- it throws the error "grad can be implicitly created only for scalar outputs". For it to work you need to perform the vector jacobian product or similar higher dimensional product with the correct rank tensor to output the gradient. The vector jacobian product is similar to a change of basis matrix transformation. This is done by writing $\textit{z.backward(v)}$ where $v$ is the correct rank tensor required for a properly defined product. Not sure about how the components of v are picked from a neural networks perspective but in the usual jacobian product it usually contains the primary basis coordinates ~.

The $\textit{.grad}$ part outputs a tensor of same rank as the input ($x$ here) containing the "vector" in the vector jacobian product, usually the pre-computated gradients wrt each output.

Actions can be taken to allow a tensor to be created without being added to the branch diagram. This prevents torch from tracking history and calculating the grad_fn attribute allowing for adjustment of the neural network. Can be done in three ways:

In [13]:
t = torch.randn(3, requires_grad=True)
print(t)

t.requires_grad_(False) #trailing underscore modifies in place
print(t)
t.requires_grad_(True)

o = t.detach()
print(t)
print(o) 

with torch.no_grad():
    y = t + 2
    print(y)

tensor([-0.5792, -0.0888,  0.2685], requires_grad=True)
tensor([-0.5792, -0.0888,  0.2685])
tensor([-0.5792, -0.0888,  0.2685], requires_grad=True)
tensor([-0.5792, -0.0888,  0.2685])
tensor([1.4208, 1.9112, 2.2685])


Whenever we call the backwards function is called, all gradients up to that point in that chain are included and summed. Need to be careful to make sure correct tensors included in calculation.

Gradients important for minimising the loss function by calculating $\frac{\partial}{\partial x} Loss$. The three steps of a neural network are:
- Forward pass: Apply functions and calculate loss
- Compute partial derivatives at each node on branch diagram wrt each input parameter
- Backward pass: Compute derivative of the loss function wrt each input parameter using the chain rule on the derivatives from the previous step

Loss is difference between predicted output and the actual output squared. Output is modelled as linear combination of the weights, $\textbf{w}$ and some input, $\textbf{x}$, ie $\hat{y} = \textbf{w} \cdot \textbf{x}$, this makes loss equal to $\left( \hat{y} - y \right)^2 = \left(\textbf{w} \cdot \textbf{x} - y \right)^2$. Minimise our loss by calculating the derivative of the loss wrt our weights in the vector $\textbf{w}$.

In [None]:
#Practice Example
x=1
y=2
w=1
#from pen and paper calculation, (wx - y)^2 = 1 and dLoss/dw = 2s(1) = 2(-1)(1) = -2 where s = wx - y

x = torch.tensor(1.0)
y = torch.tensor(2.0)
w = torch.tensor(1.0, requires_grad=True) #interested in gradient

#forward pass and compute loss
y_hat = w*x
loss = (y_hat - y)**2
print(loss)

#backward pass
loss.backward() #gradient computation
print(w.grad)
#works correctly

#Next steps: update weights
#            do next forward and backward pass
#            repeat

tensor(1., grad_fn=<PowBackward0>)
tensor(-2.)


Next we make a more concrete model by manually implementing prediction, gradient computation, loss computation and parameter updates with a view to moving to automation using torch's in-built features. This should provide a solid background understanding of how neural networks work.

In [21]:
#Linear regression: f = w*x and ignore bias at the moment

#Function to approximate: f = 2*x

#Training sample
X = np.array([1, 2, 3, 4, 5], dtype=np.float32)
Y = np.array([2, 4, 6, 8, 10], dtype=np.float32)

#Initialise weight. Start with 0.0
w = 0.0

#Manual calculation
#Model prediction
def forward(x):
    return w*x

#Loss
def loss(y, y_predicted): #y_predicted is the model output ie what is calculated in the forward pass
    return ((y_predicted - y)**2).mean() #mean squared error

#Gradient
#mean squared error = 1/N*(w*x - y)**2
#dJ/dw = 1/N*dJ/du*du/dw = 1/N*2u*x = 1/N*2x*(w*x - y)
def gradient(x, y, y_predicted):
    return np.dot(2*x, y_predicted-y).mean() #scalar product

print(f'Prediction before training: f(5) = {forward(5):.3f}')

#Training
learning_rate = 0.01
n_iters = 20

for epoch in range(n_iters):
    #prediction = forward pass
    y_pred = forward(X)

    #loss
    l = loss(Y, y_pred)

    #gradients
    dw = gradient(X, Y, y_pred)

    #update weights
    w -= learning_rate*dw #gradient descent algorithm

    if epoch % 1 == 0: #print every step
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')
print(f'Prediction after training: f(5) = {forward(5):.3f}')


Prediction before training: f(5) = 0.000
epoch 1: w = 2.200, loss = 44.00000000
epoch 2: w = 1.980, loss = 0.44000015
epoch 3: w = 2.002, loss = 0.00440001
epoch 4: w = 2.000, loss = 0.00004400
epoch 5: w = 2.000, loss = 0.00000044
epoch 6: w = 2.000, loss = 0.00000000
epoch 7: w = 2.000, loss = 0.00000000
epoch 8: w = 2.000, loss = 0.00000000
epoch 9: w = 2.000, loss = 0.00000000
epoch 10: w = 2.000, loss = 0.00000000
epoch 11: w = 2.000, loss = 0.00000000
epoch 12: w = 2.000, loss = 0.00000000
epoch 13: w = 2.000, loss = 0.00000000
epoch 14: w = 2.000, loss = 0.00000000
epoch 15: w = 2.000, loss = 0.00000000
epoch 16: w = 2.000, loss = 0.00000000
epoch 17: w = 2.000, loss = 0.00000000
epoch 18: w = 2.000, loss = 0.00000000
epoch 19: w = 2.000, loss = 0.00000000
epoch 20: w = 2.000, loss = 0.00000000
Prediction after training: f(5) = 10.000


So the key steps are:
- Create your input and idealised output arrays (X and Y above).
- Initialise the weights
- Define forward pass, loss calculation and gradient functions
- Pick a learning rate and number of iterations (small learning rate good for fine tuning of weights but can require larger number of iterations)
- Create a training loop which repeatedly calls the forward pass, loss and gradient functions 
- Update the weights in some way so that they eventually minimise the loss function
- Once loss is minimised to zero your function is fully approximated

Above everything is calculated manually. Now to automate using torch, starting with the gradient calculation.


In [None]:
#Linear regression: f = w*x and ignore bias at the moment

#Function to approximate: f = 2*x

#Training sample
X = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8, 10], dtype=torch.float32) #now tensors

#Initialise weight. Start with 0.0
w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True) #w also a tensor and we're interested
#in the gradient of the loss wrt w we specify that requires_grad=True

#Manual calculation
#Model prediction
def forward(x):
    return w*x

#Loss
def loss(y, y_predicted): #y_predicted is the model output ie what is calculated in the forward pass
    return ((y_predicted - y)**2).mean() #mean squared error

print(f'Prediction before training: f(5) = {forward(5):.3f}')

#Training
learning_rate = 0.01
n_iters = 100

for epoch in range(n_iters):
    #prediction = forward pass
    y_pred = forward(X)

    #loss
    l = loss(Y, y_pred)

    #gradients = backward pass
    l.backward() #dl/dw calculated completely by torch now

    #update weights
    #Require this to not be included in the gradient branch diagram otherwise the back pass will
    #include this step in the calculation of gradient and mess with the network so this becomes...
    with torch.no_grad():
        w -= learning_rate*w.grad #gradient descent algorithm

    #zero gradients. When we call l.backward() it writes gradients and puts them in w.grad() attribute
    w.grad.zero_()

    if epoch % 10 == 0: #print every step
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')
print(f'Prediction after training: f(5) = {forward(5):.3f}')

Prediction before training: f(5) = 0.000
epoch 1: w = 0.440, loss = 44.00000000
epoch 5: w = 1.423, loss = 6.02850342
epoch 9: w = 1.786, loss = 0.82597363
epoch 13: w = 1.921, loss = 0.11316784
epoch 17: w = 1.971, loss = 0.01550541
epoch 21: w = 1.989, loss = 0.00212438
epoch 25: w = 1.996, loss = 0.00029106
epoch 29: w = 1.999, loss = 0.00003988
epoch 33: w = 1.999, loss = 0.00000546
epoch 37: w = 2.000, loss = 0.00000075
epoch 41: w = 2.000, loss = 0.00000010
epoch 45: w = 2.000, loss = 0.00000001
epoch 49: w = 2.000, loss = 0.00000000
epoch 53: w = 2.000, loss = 0.00000000
epoch 57: w = 2.000, loss = 0.00000000
epoch 61: w = 2.000, loss = 0.00000000
epoch 65: w = 2.000, loss = 0.00000000
epoch 69: w = 2.000, loss = 0.00000000
epoch 73: w = 2.000, loss = 0.00000000
epoch 77: w = 2.000, loss = 0.00000000
epoch 81: w = 2.000, loss = 0.00000000
epoch 85: w = 2.000, loss = 0.00000000
epoch 89: w = 2.000, loss = 0.00000000
epoch 93: w = 2.000, loss = 0.00000000
epoch 97: w = 2.000, loss

Back propagation not as accurate as the analytical gradient calculation so more iterations required when using torch for back propagation. This seems like an issue for simple networks but comes in handy for more complicated networks where a lot more gradients are needed to be calculated at each node so torch is better than analytical gradient calculation for large networks with many nodes.

Adding more data to your inputs and idealised outputs improves convergence time of network.