In [3]:
conda install pytorch torchvision -c pytorch

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /anaconda3

  added / updated specs:
    - pytorch
    - torchvision


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.3.9           |           py37_0         155 KB
    conda-4.7.10               |           py37_0         3.0 MB
    ninja-1.9.0                |   py37h04f5b5a_0          96 KB
    pytorch-1.1.0              |          py3.7_0        49.9 MB  pytorch
    torchvision-0.3.0          |    py37_cuNone_1         1.7 MB  pytorch
    ------------------------------------------------------------
                                           Total:        54.8 MB

The following NEW packages will be INSTALLED:

  ninja              pkgs/main/osx-64::ninja-1.9.0-py37h04f5b5a_0
  pytorch            pytorch/osx-64::pytorch-1.1.0-py3.7_0
  torchvision     

In [1]:
import os
print(os.__file__)

/anaconda3/lib/python3.7/os.py


In [2]:
from __future__ import print_function
from IPython.display import Image
import torch
import numpy as np
# load the autoreload extension
%load_ext autoreload
# Set extension to reload modules every time before executing code
#%autoreload 2
#
## Easy to read version
#%system date
#
## Shorthand with "!!" instead of "%system" works equally well
#!!date
#!!ls
#
## Outputs a list of all interactive variables in your environment
#%who_ls
#
## Reduces the output to interactive variables of type "function"
#%who_ls function

# What is and WHY Use PyTorch?

It’s a Python-based scientific computing package targeted at two sets of
audiences:

-  **An extensible alternative for NumPy harnessing the power of GPUs**
-  a **deep learning research platform that provides maximum flexibility
   and speed**

In [3]:
# Data Generation
np.random.seed(42)
x = np.random.rand(100, 1)
y = 1 + 2 * x + .1 * np.random.randn(100, 1)

# Shuffles the indices
idx = np.arange(100)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:80]
# Uses the remaining indices for validation
val_idx = idx[80:]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]



# In your own words, what is the relation between backpropagation used in neural networks and Automatic Differentiation ? 

In order to minimise the function, one obvious approach is to use steepest descent: start with random values for the parameters to be estimated, find the direction in which the the function decreases most quickly, step a small amount in that direction and repeat until close enough.

But we have two problems:

1. We have an algorithm or a computer program that calculates the non-linear function rather than the function itself.

2. The function has a very large number of parameters, hundreds if not thousands.

One thing we could try is bumping each parameter by a small amount to get partial derivatives numerically

$$\displaystyle   \frac{\partial E(\ldots, w, \ldots)}{\partial w} \approx \frac{E(\ldots, w + \epsilon, \ldots) - E(\ldots, w, \ldots)}{\epsilon}  $$

But this would mean evaluating our function many times and moreover we could easily get numerical errors as a result of the vagaries of floating point arithmetic.

The standard approach is to use a technique called backpropagation and the understanding and application of this technique forms a large part of many machine learning lecture courses.


To elaborate a bit, you can compute the derivative of the loss with respect to all your weights naively one by one, but that is very wasteful as for every weight you end up retracing the derivative from that weight all the way to the loss without reusing any derivatives of other weights you might have computed previously.

A more efficient method using backpropagation is to first compute the derivative of the loss with respect to the weights of the last layer, keep this in memory and use those gradients to compute the derivative of the loss with respect to the weights of the next-to-last layer using the chain rule. Note that to get the derivative of the loss with respect to layer n-1 you only need the derivative of the loss with respect to layer n. By applying this method until you get to the first layer, in a single backwards swoop you get all the derivatives.

AD is just a general method of differentiating quantities with respect to others in a computational graph so indeed backpropagation is a specific subcase of AD.


In [4]:
pip install tensorboardX

Note: you may need to restart the kernel to use updated packages.


In [5]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
from tensorboardX import SummaryWriter
import numpy as np

log_path = './runs/gd/'

if log_path:
    print("accessing predefined path")
    writer = SummaryWriter(log_dir=log_path)
else :
    print("using new path set")
    writer = SummaryWriter(log_dir='./runs/gd/')

dtype = torch.FloatTensor
N, D_in, H, D_out = 64, 1000, 100, 10

# Create input and output variables x, and y 
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create weights of the hypothesis function
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate = 1e-6
errors = []
weights_1_out = []
weights_2_out = []

num_iterations = 5000

for iteration in range(num_iterations):
    # x*w1 -> relu -> * x2
    y_pred = F.relu(x.mm(w1).clamp(min=0)).mm(w2)
    
    # rmse error
    error = (y_pred - y).pow(2).sum()
    error.backward()
    
    writer.add_scalar(tag="Last run",scalar_value= error, global_step = iteration)
    writer.add_histogram("error distribution",error)

    with torch.no_grad():
        w1.data -= learning_rate * w1.grad.data
        w2.data -= learning_rate * w2.grad.data
        
    if iteration % 50 == 0:
        # print errors at each 50th iteration
        print("Iteration: %d - Error: %.4f" % (iteration, error.item()))
        weights_1_out.append(w1.cpu().detach().numpy())
        weights_1_out.append(w2.cpu().detach().numpy())
        errors.append(error.cpu().detach().numpy())

    w1.grad.data.zero_()
    w2.grad.data.zero_()
    
    # stopping criterion
    if error.item() < 1e-6:
        print("Stopping gradient descent, algorithm converged, MSE loss is smaller than 1E-6")
        break

accessing predefined path
Iteration: 0 - Error: 22613632.0000
Iteration: 50 - Error: 13316.2012
Iteration: 100 - Error: 391.1063
Iteration: 150 - Error: 18.3813
Iteration: 200 - Error: 1.0917
Iteration: 250 - Error: 0.0778
Iteration: 300 - Error: 0.0067
Iteration: 350 - Error: 0.0009
Iteration: 400 - Error: 0.0002
Iteration: 450 - Error: 0.0001
Iteration: 500 - Error: 0.0000
Iteration: 550 - Error: 0.0000
Iteration: 600 - Error: 0.0000
Iteration: 650 - Error: 0.0000
Iteration: 700 - Error: 0.0000
Iteration: 750 - Error: 0.0000
Iteration: 800 - Error: 0.0000
Iteration: 850 - Error: 0.0000
Iteration: 900 - Error: 0.0000
Iteration: 950 - Error: 0.0000
Iteration: 1000 - Error: 0.0000
Iteration: 1050 - Error: 0.0000
Iteration: 1100 - Error: 0.0000
Iteration: 1150 - Error: 0.0000
Iteration: 1200 - Error: 0.0000
Iteration: 1250 - Error: 0.0000
Iteration: 1300 - Error: 0.0000
Iteration: 1350 - Error: 0.0000
Iteration: 1400 - Error: 0.0000
Iteration: 1450 - Error: 0.0000
Iteration: 1500 - Error