# Magic Commands, Helper Functions and Decorators

In [1]:
import os
print(os.__file__)

/home/zak/anaconda3/lib/python3.7/os.py


In [2]:
from __future__ import print_function
from IPython.display import Image
import torch
import numpy as np
# load the autoreload extension
%load_ext autoreload
# Set extension to reload modules every time before executing code
#%autoreload 2
#
## Easy to read version
#%system date
#
## Shorthand with "!!" instead of "%system" works equally well
#!!date
#!!ls
#
## Outputs a list of all interactive variables in your environment
#%who_ls
#
## Reduces the output to interactive variables of type "function"
#%who_ls function


What is and WHY Use PyTorch?
================

It’s a Python-based scientific computing package targeted at two sets of
audiences:

-  **An extensible alternative for NumPy harnessing the power of GPUs**
-  a **deep learning research platform that provides maximum flexibility
   and speed**

Getting Started
----------------------------

To get started, please open the [documentation get-started](https://pytorch.org/get-started/locally/)



## Comparison to Tensorflow 1. Pytorch advantages

Please read the following questions and try to guess the answers.
<ol>
  <li>What language has Torch been written in?</li>
  <li>Where does the name PyTorch come from?</li>
   <li> What  <a href="https://en.wikipedia.org/wiki/Programming_paradigm">programming paradigm </a>  is PyTorch built upon? Choose between declarative, procedural, imperative and functional. </li>
    <li> What is the main difference between Tensorflow1 and PyTorch in terms of runtime execution? </li>
    <li> Why graph dynamic execution in Pytorch better than static graph execution used e.g. in Tensorflow? </li>
</ol>




Now please watch the [video](https://www.youtube.com/watch?v=nbJ-2G2GXL0) and compare your prior guesses to the answers given in the video. Write down what you learned.

:)

## Why Learn Pytorch? 

Every new learning would be made so much easier by gathering motivation for it. This is based on the famous <a href="https://hoishampark.wordpress.com/2017/04/14/motivation-hacker-a-book-report/">MEVID </a>  formula
Now you might be wondering that PyTorch is a cool and versatile imperative programming based deep learning framework,in which the computations are handled dynamically. 

But..

**Why should I learn it ?**

**Simple answer : Best both for short and long research projects!**

But there is much more!

[This article will give you the motivation to learn PyTorch](https://www.analyticsindiamag.com/9-reasons-why-pytorch-will-become-your-favourite-deep-learning-tool/) 

# Setting Up CUDA and Pytorch

## Setting Up CUDA

It's a good practice to develop the habit of having the right tools for each job.
- If you have a separate NVIDIA GPU in your disposition, PyTorch should be used with CUDA to speed up computation. In this case, please [install CUDA](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html)


- Otherwise you may use Google Colaboratory through which you can use Tesla K80 GPU for free! Please see [this post](https://medium.com/@ml_kid/google-colab-notebook-with-pytorch-v1-0-stable-lesson-9-46433881da05) on how to set up Google Colaboratory with Tesla. The first commands needed to run are specified in [this notebook](https://colab.research.google.com/drive/18R3Rz639Fa4ByFwzLrY5LmD8t3Ee5IUR) . You can always run any Jupyter Notebook on Google Colaboratory. To do this, just upload the notebook to Google Colaboratory while inside Colaboratory.
<div class="alert alert-warning" role="alert">
  Warning: You may run into authentication issues while trying to persist your variables on Colaboratory, especially after you wish to keep your session open to re-run it later. The free instance is periodically turned off and re-authentication may be required. Otherwise Colaboratory might complain about missing files (this may be because the session was disconnected at some point).
</div>


- Since the majority of this course is not GPU-intensive, you may also run the Notebooks on your CPU. 


In [3]:
# let us run this cell only if CUDA is available
# Print out your first tensor. Update the code with one line to print out your first tensor object
cuda0 = None
if torch.cuda.is_available():
    print("Your Pytorch runtime uses GPU")
    cuda0 = torch.device('cuda:0')
    first_tensor_on_cuda = torch.ones([2, 4], dtype=torch.int32, device=cuda0)
    print(first_tensor_on_cuda)
else :
    print("Your Pytorch runtime uses CPU")
    first_tensor_on_cpu = torch.ones([2, 4], dtype=torch.int32)



Your Pytorch runtime uses CPU



IF you wish to practice using CUDA, please install   <a href=" https://wiki.tiker.net/PyCuda/Installation"> PyCuda.</a>

Why you should use  <a href="https://devtalk.nvidia.com/default/topic/573367/pucuda-pros-and-cons/">PyCuda instead of C</a> 


In [4]:

import pycuda.driver as cuda
cuda.init()
## Get Id of default device
torch.cuda.current_device()

cuda.Device(0).name()

ModuleNotFoundError: No module named 'pycuda'


## Setting Up Tensorboard -- Visualization Platform (Works on Tensorflow, Pytorch, Keras,...)
It's always good to see the visual outputs of the code you are writing, and  especially in machine learning.Typically machine learning engineer would plot the underlying neural network as a graph, training error of the network during training time, visual outputs of the hidden layers.
That is why we give two examples of platforms that have native integration with PyTorch.

Popular Visualization platforms for Pytorch are [Visdom](https://github.com/facebookresearch/visdom) and [TensorboardX](https://github.com/lanpa/tensorboardX). According to the discussion on [reddit](https://www.reddit.com/r/MachineLearning/comments/8ej2j4/d_facebook_visdom_vs_google_tensorboard/) ,people use both, both for different purposes.

[How to use TensorboardX](http://www.erogol.com/use-tensorboard-pytorch/)

<div class="alert alert-success" role="alert">
  Followingly, the commands ran assume that you have a  Conda Virtualenv called dl created , and we are going to install the packages there.

</div>


In [5]:
#!yes | conda install -n dl -c conda-forge tensorboardx
# start Tensorboard instance
#! yes |conda install -n dl -c conda-forge tensorflow 

# ! yes |conda install -n dl -c conda-forge tensorboard 
log_path = './runs/gd/'
!tensorboard --logdir log_path
# to add more objects to Tensorboard, please read the manual

/usr/bin/sh: 1: tensorboard: not found


# Basic Data Structures in PyTorch. Tensors.

## Tensor Definition and Some Important Properties
### Definition of a n-th Order Tensor

**Practical Definition of a Tensor** :
An **n-th Order Tensor** is a n-dimensional array of numbers.
In this definition, each dimension is considered to be  independent of each other.

By *practical* we mean that this is the definition that is used in computer programming and software libraries that are used in the industry and practice.

The word *tensor* comes from physics and was initially used to describe the **tension** on materials. Since it was necessary to  describe the tension on each face of a solid body, a simple 2D matrix was enough, since the first dimension could be used to denote the normal direction of the face and the 2nd dimension the direction of the tension. 


<div class="alert alert-info"><h4>Comparison to mathematical definition of tensor</h4><p>
This definition differs significantly from the mathematically rigorous definition of a tensor, in which case a n=(p+q)-order or (p,q)-tensor is defined as multilinear mapping that is linear with respect to each of its arguments (p vectors and q  co-vectors or differential forms) that retains certain invariants under a coordinate transformation. Thus, in mathematics, only those multilinear mappings are tensors that retain its invariants.
    </p></div>

**Henceforth we are only going to use the practical definition of a tensor.**





## Tensor Interpretation in Programming Context
<div class="alert alert-success" role="alert">
Tensors are similar to Python's NumPy’s ndarrays but they have the additional property that
Tensor data structure can be scaled up horizontally and thus can also be used on a GPU to accelerate computing. 
</div>

[Tensor](https://en.wikipedia.org/wiki/Tensor) is also the **basic data structural unit in PyTorch**.


A 2D-matrix is an example of a 2-nd order tensor, but a specific example of a n-th order **tensor**, which in general has  **n** independent components. Consider the following example:


In [6]:
#Image("assets/img/tensor.jpg")
# Source : https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwirxuj998nhAhUqtYsKHfFbAhUQjRx6BAgBEAU&url=https%3A%2F%2Fwww.slideshare.net%2FBertonEarnshaw%2Fa-brief-survey-of-tensors&psig=AOvVaw0tULmnEC2-vr346HuYGbdQ&ust=1555137171132510

We see that to represent a single pixel on an image, we need 3 independent components:

    > 1-st component denotes the x-location (width-location)
    > 2-nd component denotes the y-location (height-location)
    > 3-rd component denotes the color channel (R,G,B) since any colour displayed on a computer screen is formed from the 3x 8-bit (R,G,B)-triplet each having values between 0 and 255.
    
Thus we can think of a cat image on a computer screen  of as a discrete rectangular prism or a multidimensional array of numbers as shown on the above image. This rectangular n-dimensional array of numbers is called an **n-th order tensor**.

This conceptually natural connection between image and a tensor is very important since this forms the basis why GPUs are used in machine learning.
GPU is a graphical processing unit, usually containing many computational cores (many more than CPU, central processing unit) and they are optimized for operations that are done with images, which we now know to be n-dimensional tensors. 

*Thus it is believable why GPUs should be suitable for doing computations with tensors or high-dimensional data.*


### Tensors and Operations with Tensors in Pytorch

Followingly, let's see how to form tensors in PyTorch.
Tensors can be created from both Python base class data (e.g. list of lists) as well as from Numpy array data.
To get started, please check the [documentation](https://pytorch.org/docs/stable/tensors.html)


In [7]:
tensor_from_2d_list = torch.tensor([[1., -1.], [1., -1.]])


# Print the tensor object created from Python List 
# Write your code here ...
print(tensor_from_2d_list)


tensor_from_np_array = torch.tensor(np.array([[1, 2, 3], [4, 5, 6]]))


# Print the tensor object created from Python Numpy array 
# Write your code here ...
print(tensor_from_np_array)

tensor([[ 1., -1.],
        [ 1., -1.]])
tensor([[1, 2, 3],
        [4, 5, 6]])


Following are some common ways to create a [Tensor](#Definition-of-a-n-th-Order-Tensor):

In [8]:
a = torch.empty(2,2)
b = torch.zeros(2, 2, dtype=torch.long)
c = torch.rand(5, 5)

d = a.new_ones(10, 10, dtype=torch.double)
e = torch.randn_like(d, dtype=torch.float) 
print(e)

tensor([[-0.4466,  2.0338, -0.6191,  1.0366, -2.0438,  0.0694, -0.6019, -0.5677,
          0.1001, -1.5915],
        [ 0.7790, -0.5022,  1.0606,  0.9840, -0.9762,  0.9430, -0.0581,  1.7782,
          0.6172, -0.2088],
        [ 0.6142, -0.3272,  1.2155,  0.2993,  1.4952, -2.1090, -1.9115, -1.3554,
          0.6390,  0.3046],
        [ 0.8973, -0.4599, -1.1134,  0.1304, -1.0458, -0.6904, -0.3858, -1.9713,
         -0.3509, -0.2331],
        [ 0.4404,  0.6904, -0.2822,  0.0271, -1.0838,  0.8653,  0.6715,  1.3066,
         -1.0835,  1.9123],
        [-1.1164,  0.6927,  0.4199, -1.1343, -0.9774, -0.2635, -1.3888, -0.0517,
          0.7196, -0.0179],
        [-0.4479, -0.3884,  0.8258, -0.5185,  0.8185, -0.9299, -0.5635,  1.4531,
          1.7526,  0.1443],
        [ 0.8762,  1.1342, -0.8788, -0.7893, -0.2976,  0.9741,  0.2655,  1.2310,
          0.7198,  0.7662],
        [-0.1325,  0.4499, -0.0308,  0.3103,  0.7242,  0.2635, -1.8343, -1.4760,
         -1.1578,  0.4909],
        [-0.2083,  

Questions:
1. What is the difference between tensors a and b?
Q.1: b contains zeros in shape 2x2, a contains only not assigned values as empty
2. What about d and e? When would you use randn_like function?
Q.2: d contains ones in shape 10x10, e contains randon floats in shape of d.
    

Followingly see the available methods on the tensor object by clicking x. and TAB
To see the DOCSTRING of the specific function, select or type the chosen function and then click SHIFT+TAB.

In [9]:
#Image("assets/img/tabcomplete.png")

        

###  Tensor to Numpy array conversion



In [10]:
test_array = np.arange(16)
test_tensor = torch.from_numpy(test_array)
test_array2 = test_tensor.numpy()
print(f"The type of test_array is, {type(test_array)}\n")
print(f"The type of test_tensor is, {type(test_tensor)}\n")
print(f"Are the shapes of the objects are equal? : {test_array.shape == test_tensor.shape}\n")
print(f"After converting the tensor back to Numpy array, is the initial array equivalent to the converted array?")
print(f"{all(test_array == test_array2)}")

The type of test_array is, <class 'numpy.ndarray'>

The type of test_tensor is, <class 'torch.Tensor'>

Are the shapes of the objects are equal? : True

After converting the tensor back to Numpy array, is the initial array equivalent to the converted array?
True


### Tensor Reshaping : View

In [11]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8) 
print(x.size(), y.size(), z.size())

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])


View is a similar function to Numpy's reshape. What is the meaning of parameter -1?

<div class="alert alert-info"><h4>Note</h4><p>-1 means that if we don't know what is the dimensionality count in a particular dimension, leave it unspecified by writing  -1. In this case, the number of samples in this dimension is inferred from the other dimensions, e.g. if you have a 4 x 4 array x and you use x.view(-1,8), then -1 stands for 2.</p></div>



## Tensors on CUDA

The natural representation of many types of data is a in a form of a high-dimensional array. 
GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit.
Next we are going to define the CUDA device, if available:

In [12]:
# create_formatted_var("device")
# We will use  
if cuda0 is not None:
    device = cuda0
    print("Cuda device cuda0 loaded before")
elif torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
else :
    print("Your system doesn't have CUDA")

Your system doesn't have CUDA


To save a data structure object directly to GPU, please use the ``device`` argument.
If the data object was created before in a memory location other than GPU-memory, use the ``tensorobject.to(device)`` syntax to transfer the data structure from memory to GPU memory.

In [13]:
if cuda0 is not None:
    x = torch.ones(3,device="cuda")
else :
    cuda0 = 'cpu'
    x = torch.ones(3,device=cuda0)
x.device.type

'cpu'

# Basic Transformations in Pytorch

## Addition of Tensors

Before learning any transformations, it would be useful to know how quickly a  given transformation is going to run. For that purpose, please feel free to make use of the iPython Magic Command %%timeit , that creates a loop and evaluations its running time in Jupyter Notebook.

Take 5 minutes to go over the next TOP 5 Magic commands in Jupyter, it might you save a lot of time later

[TOP 5 Jupyter Magic commands](https://towardsdatascience.com/the-top-5-magic-commands-for-jupyter-notebooks-2bf0c5ae4bb8)

In [14]:
%%time
x = torch.tensor([1,2,3],device=cuda0)
y = torch.tensor([5,2,1],device=cuda0)

# way 1 :
result = torch.add(x,y)
print(result,y)

print(x)

tensor([6, 4, 4]) tensor([5, 2, 1])
tensor([1, 2, 3])
CPU times: user 1.52 ms, sys: 299 µs, total: 1.82 ms
Wall time: 1.1 ms


Addition: providing an output tensor as argument



In [15]:
# adds x to y
#y.add_(x)
x = torch.tensor([1,2,3],device=cuda0)
print(x)

tensor([1, 2, 3])


<div class="alert alert-info"><h4>Note</h4><p>Any operation that mutates a tensor in-place is post-fixed with an ``_``.
    For example: ``x.copy_(y)``, ``x.t_()``, will change ``x``.</p></div>

You can use standard NumPy-like indexing with all bells and whistles!



### Homework Exercise on Tensor Adding Error Handling:

Write a function ``try_adding_different_locations`` that has 4 input arguments:
- x : This is the input tensor, it should be passed as a default argument, that is 2nd-order 3x3 tensor of ones, created on CPU
- device: This should be passed as a default argument, and it is the CUDA device available on your system
- notbtoh : bool. This argument should be passed as boolean with True as default value
- output_type : string with default_value "cpu"

TASK: **The function should probe out all combinations in which the tensor addition should or should not work.**

In [16]:
def try_adding_different_locations(x = torch.tensor([1,2,3]) , device = cuda0, notbtoh = True, output_type = 'cpu'):
    A = x
    print(A)
                                   
    

## Matrix Multiplication on Tensor Objects using mm

From linear algebra course at school, you may remember the concept of matrix multiplication.
Since  image is a tensor, and tensor is a generalization of a matrix, matrix multiplication also works for tensors.

In a course taught in universities, called **Tensor Calculus**, the defined operations between tensors include tensor addition, tensor product and contraction, for example.

**Tensor product is not implemented in PyTorch**.
Practical applications reduce to to squeezing dimensions over 2, resulting in a 2-D tensor (or matrix), when multiplication is needed, thus the PyTorch multiplication is actually a matrix multiplication.

To do tensors multiplication, use ``torch.mm`` or ``torch.matmul`` .
<div class="alert alert-warning" role="alert">
Don't confuse ``dot`` and ``mm`` operators, check <a href="https://stackoverflow.com/questions/44524901/how-to-do-product-of-matrices-in-pytorch" class="alert-link">this link.</a>
 
To get element-wise product, use A*B.
</div>


In [17]:
a = torch.from_numpy(np.array([[1,2,3],[4,5,6]]))
b = a
print(a.shape)
print(b.shape)
c = torch.mm(a,b)# doens't work, do you know why?


torch.Size([2, 3])
torch.Size([2, 3])


RuntimeError: size mismatch, m1: [2 x 3], m2: [2 x 3] at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/TH/generic/THTensorMath.cpp:752

**Can the tensors a and  b be matrix-multiplied? Why / Why not?**

To matrix-multiply objects, the last dimension of the first object needs to coincide with the first dimension of the 2nd object.
If we have two objects with identical dimensionalities, one way to make the multiplication compatible is  to transpose either of those objects. Let's try it.

In [18]:
c = torch.mm(a,b.t())
d = torch.mm(a.t(),b)
print(c)
print(d)

tensor([[14, 32],
        [32, 77]])
tensor([[17, 22, 27],
        [22, 29, 36],
        [27, 36, 45]])


## In-Place Operations

Guess what is the following code going to print:

In [19]:
a = torch.tensor(np.array([[1,2,3],[4,5,6]]))
b = a
try:
    torch.mm(a,b.t())
    torch.add(a,b)
    print("First multiplication and addition succeeded")
except :
    print("First  multiplication failed")

try :
    print(f"The result of in-place addition is {a.add_(b)}")
    print("Second addition succeeded")
    (a.t()).mm_(b)
    print("Second multiplication succeeded")
except:
    print("Second  multiplication failed")


First multiplication and addition succeeded
The result of in-place addition is tensor([[ 2,  4,  6],
        [ 8, 10, 12]])
Second addition succeeded
Second  multiplication failed


Was your guess correct?

The crux lies in the _ meaning: We can often find underscored_versions of methods in PyTorch, which stand for in-place operations. That means that the object on which the operation is called, is changed in the computer memory, without explicitly saving the result.

Thus, often a binary operations $$(x , y)  \rightarrow z$$ that operates on two arguments can be made to be unary,
resulting in

$$x.method(y) \rightarrow z$$,

operating only on one argument, while the object the method is operating on (in this case ``x``), is meant to be overwritten in the computer memory, since the operation is done in-place.


<div class="alert alert-warning" role="alert">
    Notice however that multiplication as defined by ``mm`` or ``matmul`` or ``@`` does not have an in-place version, this it is always binary.
</div>


## Neural Network Definition
[Neural Networklink on Wikipedia](https://en.wikipedia.org/wiki/Neural_network)

A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes.  
The connections of the biological neuron are modeled as weights denoted by $w$ in the following graph. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed . Finally, an activation function (in the following graph, denoted by $f$) controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1. It is said that each neuron in the following layer gets the linear combination of activations (referring to the use of $f$ in the image below) of neurons from the previous layer. 

In [20]:
from IPython.display import Image
#Image("assets/img/neural_network.png")

## Autograd : Automatic Differentiation

<div class="alert alert-success" role="alert">
  For the definition of neural network, please refer to the following link:
</div>



[NEURAL NETWORK DEFINITION](hb#Neural-Network-Definition)

**Autograd**, as one of the modern automatic differentiation frameworks, lies in the **core of Pytorch**.

Thus, it would be a very good idea to [Read more about Autograd](https://pytorch.org/docs/master/notes/autograd.html)

Autograd package in PyTorch automates the computation of backward passes in computational graphs.

In Autograd: 
- the forward pass of the neural network will define a computational graph
- nodes in the graph will be Tensors
- edges are functions producing output Tensors from input Tensors
- the backward pass of the neural network is for easy computation of gradients for those tensors  ``x``, for which  ``x.requires_grad``= ``True``, with respect to some scalar.

**Thus, to compute gradients of some variables, use functions that allow to pass  ``x.requires_grad``= ``True``.**

In [21]:
x = torch.tensor([2],requires_grad=True,dtype=torch.float)
y = torch.tensor([3],requires_grad=True,dtype=torch.float)
z = x + y 
z.backward()
print(f"Gradient with respect to z is {z.grad}")
print(f"Gradient with respect to x is {x.grad}")
print(f"Gradient with respect to y is {y.grad}")

Gradient with respect to z is None
Gradient with respect to x is tensor([1.])
Gradient with respect to y is tensor([1.])


Here we saw the most basic way to get a derivative or gradient.
We cannot compute the gradient of z, because it has not been initialized with the argument requires_grad=True.
We can though compute the gradient of x, and since derivative from x w.r.t. x is 1, the result is tensor([1.]), since the chosen data type was floating point number.

<div class="alert alert-warning" role="alert">
 Note: Often in neural networks, x and y are symbols used for input or output data and usually we are never required to compute gradients w.r.t. to the data.
    
The reason is simple:
When we have a model, e.g. like a neural network that is a black  box model, the model itself shouldn't have any assumptions or explicit dependencies on the data. It's up to the machine learning engineer to define the model or network structure in a way that the model learns from the data, but the underlying machinery should not hold explicit dependency on the data.
</div>

*Let's take a look at another example*

In [22]:
x = torch.tensor([1, 2], requires_grad=True, dtype=torch.float)
y = torch.tensor([1, 3], requires_grad=True, dtype=torch.float)
print(f"At the beginning, x is {x}")
print(f"At the beginning, y is {y}")

x = x ** 2
y = y ** 2

print(f"After squaring, x is {x}")
print(f"After squaring, y is {y}")

x.add_(y)

print(f"After adding y to x in-place, the value of x is {x}")

z = x.sum()
print(f"The sum of the elements in this array is {z}")
z.backward()
print("Finally, after running the backward step to compute the gradients, we get:")
print(z)
# Note that z has no gradient requirement by default, thus z.grad prints None.
print(f"The gradient of z is {z.grad}")

At the beginning, x is tensor([1., 2.], requires_grad=True)
At the beginning, y is tensor([1., 3.], requires_grad=True)
After squaring, x is tensor([1., 4.], grad_fn=<PowBackward0>)
After squaring, y is tensor([1., 9.], grad_fn=<PowBackward0>)
After adding y to x in-place, the value of x is tensor([ 2., 13.], grad_fn=<AddBackward0>)
The sum of the elements in this array is 15.0
Finally, after running the backward step to compute the gradients, we get:
tensor(15., grad_fn=<SumBackward0>)
The gradient of z is None


### Verifying the correctness of Autograd Output

In [23]:
from __future__ import print_function
from torch.autograd import Variable
import torch
import numpy as np
# important : every time before generating random numbers, you should set the seed!
torch.manual_seed(0)
np.random.seed(0)

x = Variable(torch.randn(1,1), requires_grad = True)
print(f"The value sampled from normal distribution gives {x} \n")
y = 3*x
print(f"When we form a linear function of {x} by multiplying this sample value by 3, we get {y}\n")
z = y**2 # derivative w.r.t. x is:
print(f"After that we form a quadratic function of y, and at this sample value, it takes a value of {z} \n")
print("Now we are going to register the hooks for each variable \n")
x.register_hook(print)
y.register_hook(print)

#z.register_hook(print)
z.backward(retain_graph=True) # prints all the gradients needed
y.backward(retain_graph=True) # prints all the gradients needed

print(f"Gradients obtained from the first graph are {z.grad} \n")
#print(f"Gradients obtained from the second graph are {z2.grad} \n")





The value sampled from normal distribution gives tensor([[1.5410]], requires_grad=True) 

When we form a linear function of tensor([[1.5410]], requires_grad=True) by multiplying this sample value by 3, we get tensor([[4.6230]], grad_fn=<MulBackward0>)

After that we form a quadratic function of y, and at this sample value, it takes a value of tensor([[21.3720]], grad_fn=<PowBackward0>) 

Now we are going to register the hooks for each variable 

tensor([[9.2460]])
tensor([[27.7379]])
tensor([[1.]])
tensor([[3.]])
Gradients obtained from the first graph are None 



You might have seen that z.grad doesn't print anything, but to see the gradient components, you have to register the hook for that particular variable.

**The same example with the gradient conflict:**

In [24]:
torch.manual_seed(0)
x = Variable(torch.randn(1,1), requires_grad = False)
y2 = Variable(3*x,requires_grad=False)
#y2.register_hook(print)
z2 = Variable((y2)**2,requires_grad=True)
z2.register_hook(print)
z2.backward()
print(f"The value of y is {y}")
print(z2.grad)

tensor([[1.]])
The value of y is tensor([[4.6230]], grad_fn=<MulBackward0>)
tensor([[1.]])


In [25]:
print(f"The z value is {9*1.54**2}")
print(f"The grad_y(z) is {6*1.54}")
print(f"The grad_x(z) is {18*1.54}")

The z value is 21.3444
The grad_y(z) is 9.24
The grad_x(z) is 27.72


### Recap : How to properly pass tensors and find their gradients

In [26]:
from __future__ import print_function
from torch.autograd import Variable
import torch
torch.manual_seed(0)
xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz = yy**2

yy.register_hook(print)
zz.backward()

tensor([[9.2460]])


To be sure that the differentiation is working correctly, let us compute the following:
$$A(x)  = \partial_x z(y(x)) = \partial_x (y^2(x)) = \partial_x (9x^2) = 18x$$.
If $x = -1.5298$, then $$A(x) = 18 \cdot x = -27.5364 \approx -27.5356$$ 

and we see how Pytorch could be used as a symbolic mathematics framework to compute e.g. derivatives, since the chain rule is implemented there internally. With computationally intensive tasks, the effective implementation of automatic differentiation could be a huge benefit!

In [27]:
18*(-1.5298) 

-27.5364

One could observe that PyTorch's grad isn't really a gradient, because the normally known gradient as  in fact a directional derivative. If we wish to take the derivative from x w.r.t. x, we get:

In [28]:
x = torch.tensor([2,4,5],requires_grad=True,dtype=torch.float)
x.register_hook(print)
x.backward()
print(f"Gradient with respect to x is {x.grad}")

RuntimeError: grad can be implicitly created only for scalar outputs

<div class="alert alert-warning" role="error">

RuntimeError: grad can be implicitly created only for scalar outputs
</div>

## Autograd Application : Gradient Descent in PyTorch

Next we are going to take a look at a basic example, adopted from [Github demo for Autograd](https://github.com/jcjohnson/pytorch-examples#pytorch-autograd)

PLEASE MAKE SURE THAT log_path variable is initialized properly
[in setup](#Setting-Up-Visualization-Platform-on-Pytorch)

In [29]:
cmd = "pip install --upgrade tensorboardX --user "
pw = "data"
!echo {pw}|  {cmd}

Requirement already up-to-date: tensorboardX in /home/zak/.local/lib/python3.7/site-packages (1.8)


In [30]:
from tensorboardX import SummaryWriter
log_path = './runs/gd/'

if log_path:
    print("accessing predefined path")
    writer = SummaryWriter(log_dir=log_path)
else :
    print("using new path set")
    writer = SummaryWriter(log_dir='./runs/gd/')
#In addition to SummaryWriter, there are also other writers, please check the manual
# https://tensorboardx.readthedocs.io/en/latest/tutorial.html

# !tensorboard --logdir log_path --host localhost --port 8088
# you have to execute tensorboard command from another shell, otherwise you cannot proceed with running the notebook
# read this : https://medium.com/@anthony_sarkis/tensorboard-quick-start-in-5-minutes-e3ec69f673af


accessing predefined path


In [31]:
writer

<tensorboardX.writer.SummaryWriter at 0x7f6c0a324c18>

In [32]:
from torch.distributions import normal
m = normal.Normal(4.0, 5.0)
m

Normal(loc: 4.0, scale: 5.0)

In [33]:
# Code in file autograd/two_layer_net_autograd.py
import torch
import torch.nn.functional as F # to use relu function as an activation function

#device = torch.device('cuda') # Uncomment this to run on GPU
device = torch.device('cpu') # Uncomment this to run on CPU

# N is batch size;
# D_in is input dimension;
# H is hidden dimension; 
#D_out is output dimension.
N, D_in, H, D_out = 64, 350, 50, 20

# input data : batch of 64 times 1000 features
# output data : 100 x 10 continues values (real scalars)

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device, dtype=torch.float)
# generate normally distributed data of dim NxD_in, store it on device, requires_grad = False
# tensor contains random values as input
y = torch.randn(N, D_out,device=device,dtype=torch.float)
# tensor contains random values as output
# generate normally distributed data of dim NxD_out, store it on device, requires_grad = False

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
# here the gradient will be computed for the variables that are related to the model learning something new,
# i.e. the network weights in this case


# WEIGHTS
#weights for each nodes
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
#generate normally distributed data of dim D_in x H, store it on device, requires_grad = True 
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
#generate normally distributed data of dim H x D_out, store it on device, requires_grad = True 

# initialize loss value to a high number 

# initialize arrays errors, w1_array and w2_array to empty lists
errors = []# write here
w1_array = [] # write here
w2_array = [] # write here
 # set the network learning rate parameter 'learning_rate' to some small number
learning_rate = 1e-6 # step size for gd update rule -coefficient of gradient-
for epoch in range(10000): # for loop to update weights and calculate gradient in each iteration 
   y_pred = F.relu(x.mm(w1)).mm(w2) # forward pass with relu function
   loss_value = ((y_pred - y) ** 2).sum() # to find MSE 
   print(epoch, loss_value.item())
   loss_value.backward() # autograd is activated so we can calculate gradient with this command in easier way
   with torch.no_grad():
       w1 -= learning_rate * w1.grad # update rule :weights will be updated in each loop 
        # new_weight = current_weight - stepsize * gradient of the function
       w2 -= learning_rate * w2.grad
       # Manually zero the gradients after updating weights
       w1.grad.zero_() # need to zero the grad to avoid the grad accumulating
       w2.grad.zero_()

0 10607453.0
1 7779231.0
2 6444332.0
3 5619250.0
4 5012568.5
5 4524032.5
6 4113123.0
7 3759964.75
8 3453118.0
9 3183623.5
10 2945894.0
11 2734977.75
12 2546601.0
13 2377490.75
14 2224908.0
15 2086771.5
16 1961294.5
17 1847046.625
18 1742680.75
19 1646978.5
20 1558809.25
21 1477657.125
22 1402774.25
23 1333690.75
24 1269726.25
25 1210315.625
26 1155024.625
27 1103480.625
28 1055313.0
29 1010295.8125
30 968199.5
31 928716.25
32 891661.3125
33 856832.75
34 824045.9375
35 793136.0
36 763965.5625
37 736425.8125
38 710395.4375
39 685733.9375
40 662371.75
41 640246.25
42 619219.75
43 599225.9375
44 580217.3125
45 562146.125
46 544930.9375
47 528503.25
48 512845.21875
49 497923.0625
50 483661.9375
51 470002.5
52 456932.15625
53 444428.9375
54 432442.25
55 420940.03125
56 409910.5625
57 399321.9375
58 389159.75
59 379384.875
60 369985.4375
61 360955.71875
62 352263.0
63 343895.4375
64 335826.84375
65 328047.3125
66 320543.03125
67 313306.59375
68 306321.34375
69 299573.6875
70 293061.84375
71 2

642 5528.287109375
643 5510.6162109375
644 5493.01220703125
645 5475.47265625
646 5458.02734375
647 5440.67578125
648 5423.40478515625
649 5406.20947265625
650 5389.10888671875
651 5372.0703125
652 5355.1181640625
653 5338.234375
654 5321.42431640625
655 5304.705078125
656 5288.0693359375
657 5271.494140625
658 5255.01025390625
659 5238.59375
660 5222.25732421875
661 5205.98876953125
662 5189.78955078125
663 5173.66943359375
664 5157.63134765625
665 5141.6748046875
666 5125.77490234375
667 5109.9638671875
668 5094.208984375
669 5078.5283203125
670 5062.916015625
671 5047.3779296875
672 5031.91064453125
673 5016.525390625
674 5001.19091796875
675 4985.92529296875
676 4970.7373046875
677 4955.6142578125
678 4940.5556640625
679 4925.58056640625
680 4910.67431640625
681 4895.81494140625
682 4881.02734375
683 4866.2919921875
684 4851.638671875
685 4837.05126953125
686 4822.53173828125
687 4808.0615234375
688 4793.6845703125
689 4779.35009765625
690 4765.0712890625
691 4750.85888671875
692 4

1316 1293.6065673828125
1317 1291.69970703125
1318 1289.797607421875
1319 1287.900146484375
1320 1286.008544921875
1321 1284.1201171875
1322 1282.2371826171875
1323 1280.3590087890625
1324 1278.485107421875
1325 1276.6156005859375
1326 1274.750244140625
1327 1272.8892822265625
1328 1271.0316162109375
1329 1269.1787109375
1330 1267.330078125
1331 1265.485595703125
1332 1263.646240234375
1333 1261.81298828125
1334 1259.982177734375
1335 1258.1553955078125
1336 1256.33203125
1337 1254.51416015625
1338 1252.6995849609375
1339 1250.889892578125
1340 1249.08447265625
1341 1247.283203125
1342 1245.4869384765625
1343 1243.6947021484375
1344 1241.906005859375
1345 1240.1199951171875
1346 1238.3404541015625
1347 1236.56494140625
1348 1234.79296875
1349 1233.0228271484375
1350 1231.259521484375
1351 1229.50048828125
1352 1227.744140625
1353 1225.9913330078125
1354 1224.2470703125
1355 1222.5030517578125
1356 1220.761962890625
1357 1219.027587890625
1358 1217.2950439453125
1359 1215.5687255859375


1697 800.8126220703125
1698 799.954833984375
1699 799.0995483398438
1700 798.2440185546875
1701 797.3919067382812
1702 796.5407104492188
1703 795.6912841796875
1704 794.8425903320312
1705 793.99560546875
1706 793.151611328125
1707 792.3070678710938
1708 791.4654541015625
1709 790.6259765625
1710 789.7869262695312
1711 788.9498291015625
1712 788.1143798828125
1713 787.2801513671875
1714 786.4476318359375
1715 785.6157836914062
1716 784.7863159179688
1717 783.9586791992188
1718 783.1326904296875
1719 782.3076782226562
1720 781.484375
1721 780.662841796875
1722 779.8427124023438
1723 779.0233154296875
1724 778.206298828125
1725 777.3899536132812
1726 776.5755615234375
1727 775.7620849609375
1728 774.9508666992188
1729 774.1400146484375
1730 773.3323974609375
1731 772.52587890625
1732 771.720458984375
1733 770.915771484375
1734 770.1124877929688
1735 769.3125
1736 768.5118408203125
1737 767.7132568359375
1738 766.916748046875
1739 766.1217651367188
1740 765.3281860351562
1741 764.535766601

2379 451.9397277832031
2380 451.64837646484375
2381 451.3575439453125
2382 451.06719970703125
2383 450.7774658203125
2384 450.4879150390625
2385 450.1986389160156
2386 449.9095458984375
2387 449.6214599609375
2388 449.33343505859375
2389 449.0458679199219
2390 448.75823974609375
2391 448.47119140625
2392 448.184814453125
2393 447.89886474609375
2394 447.6134338378906
2395 447.3285217285156
2396 447.04339599609375
2397 446.7586669921875
2398 446.4742431640625
2399 446.19073486328125
2400 445.90753173828125
2401 445.624755859375
2402 445.3421325683594
2403 445.06060791015625
2404 444.7784423828125
2405 444.496826171875
2406 444.21575927734375
2407 443.9353332519531
2408 443.6549377441406
2409 443.37506103515625
2410 443.09539794921875
2411 442.816162109375
2412 442.53741455078125
2413 442.25921630859375
2414 441.9813537597656
2415 441.70379638671875
2416 441.42645263671875
2417 441.14959716796875
2418 440.87261962890625
2419 440.5972595214844
2420 440.3214111328125
2421 440.0458374023437

3040 322.19622802734375
3041 322.0675964355469
3042 321.9387512207031
3043 321.810546875
3044 321.68231201171875
3045 321.5542297363281
3046 321.4261474609375
3047 321.29840087890625
3048 321.17095947265625
3049 321.0433044433594
3050 320.9159240722656
3051 320.7889099121094
3052 320.6618347167969
3053 320.5349426269531
3054 320.4081115722656
3055 320.2811584472656
3056 320.1546936035156
3057 320.02838134765625
3058 319.9019470214844
3059 319.77593994140625
3060 319.6502380371094
3061 319.5244140625
3062 319.3985595703125
3063 319.27301025390625
3064 319.1476745605469
3065 319.0223693847656
3066 318.897216796875
3067 318.772216796875
3068 318.64715576171875
3069 318.5224914550781
3070 318.39788818359375
3071 318.27313232421875
3072 318.148681640625
3073 318.0248107910156
3074 317.9007263183594
3075 317.77679443359375
3076 317.65289306640625
3077 317.5293884277344
3078 317.4058837890625
3079 317.2820739746094
3080 317.1589660644531
3081 317.03594970703125
3082 316.9128723144531
3083 316

3404 283.3742370605469
3405 283.28631591796875
3406 283.1986999511719
3407 283.11083984375
3408 283.0234375
3409 282.9358825683594
3410 282.8485107421875
3411 282.7613220214844
3412 282.67413330078125
3413 282.5869445800781
3414 282.5000915527344
3415 282.4129333496094
3416 282.32635498046875
3417 282.2393798828125
3418 282.1527099609375
3419 282.06640625
3420 281.97998046875
3421 281.89361572265625
3422 281.80755615234375
3423 281.7213439941406
3424 281.63507080078125
3425 281.5491943359375
3426 281.4631652832031
3427 281.37744140625
3428 281.2917785644531
3429 281.20611572265625
3430 281.1204528808594
3431 281.03515625
3432 280.9494323730469
3433 280.86419677734375
3434 280.7790832519531
3435 280.6939697265625
3436 280.6087341308594
3437 280.52386474609375
3438 280.4390563964844
3439 280.3539123535156
3440 280.26934814453125
3441 280.18487548828125
3442 280.1004943847656
3443 280.0159606933594
3444 279.9316101074219
3445 279.84722900390625
3446 279.76312255859375
3447 279.67907714843

4079 239.72975158691406
4080 239.68374633789062
4081 239.6373291015625
4082 239.59146118164062
4083 239.54547119140625
4084 239.49945068359375
4085 239.45358276367188
4086 239.40777587890625
4087 239.36175537109375
4088 239.31594848632812
4089 239.27003479003906
4090 239.22438049316406
4091 239.1786346435547
4092 239.1332244873047
4093 239.08749389648438
4094 239.0418243408203
4095 238.99627685546875
4096 238.9506378173828
4097 238.90528869628906
4098 238.8599090576172
4099 238.8145294189453
4100 238.76907348632812
4101 238.72366333007812
4102 238.67857360839844
4103 238.63320922851562
4104 238.5880126953125
4105 238.5431365966797
4106 238.49766540527344
4107 238.45262145996094
4108 238.40760803222656
4109 238.36285400390625
4110 238.31776428222656
4111 238.27294921875
4112 238.22793579101562
4113 238.18307495117188
4114 238.1382293701172
4115 238.09335327148438
4116 238.04864501953125
4117 238.00390625
4118 237.95932006835938
4119 237.9147491455078
4120 237.8700714111328
4121 237.8254

4459 224.62477111816406
4460 224.59072875976562
4461 224.55673217773438
4462 224.52259826660156
4463 224.488525390625
4464 224.4544219970703
4465 224.42051696777344
4466 224.38673400878906
4467 224.3529052734375
4468 224.31890869140625
4469 224.28488159179688
4470 224.25125122070312
4471 224.21726989746094
4472 224.18357849121094
4473 224.14981079101562
4474 224.1160430908203
4475 224.0823516845703
4476 224.04843139648438
4477 224.014892578125
4478 223.98117065429688
4479 223.94747924804688
4480 223.91403198242188
4481 223.88034057617188
4482 223.84683227539062
4483 223.813232421875
4484 223.77972412109375
4485 223.74624633789062
4486 223.71282958984375
4487 223.67941284179688
4488 223.64596557617188
4489 223.61244201660156
4490 223.5791015625
4491 223.5457763671875
4492 223.51246643066406
4493 223.4792938232422
4494 223.4460906982422
4495 223.4127197265625
4496 223.379638671875
4497 223.34634399414062
4498 223.31301879882812
4499 223.27981567382812
4500 223.24685668945312
4501 223.213

5153 205.77996826171875
5154 205.75833129882812
5155 205.73696899414062
5156 205.71536254882812
5157 205.69383239746094
5158 205.67225646972656
5159 205.65086364746094
5160 205.62940979003906
5161 205.60797119140625
5162 205.58644104003906
5163 205.5650634765625
5164 205.5435791015625
5165 205.522216796875
5166 205.50094604492188
5167 205.47943115234375
5168 205.458251953125
5169 205.4366455078125
5170 205.41551208496094
5171 205.39414978027344
5172 205.372802734375
5173 205.35153198242188
5174 205.33035278320312
5175 205.3091278076172
5176 205.2877655029297
5177 205.26649475097656
5178 205.24525451660156
5179 205.22410583496094
5180 205.20297241210938
5181 205.181884765625
5182 205.16053771972656
5183 205.13938903808594
5184 205.11814880371094
5185 205.09718322753906
5186 205.075927734375
5187 205.05479431152344
5188 205.03382873535156
5189 205.01263427734375
5190 204.9915771484375
5191 204.97044372558594
5192 204.9495849609375
5193 204.92843627929688
5194 204.90745544433594
5195 204.

5529 198.57269287109375
5530 198.5556182861328
5531 198.53846740722656
5532 198.5213623046875
5533 198.50442504882812
5534 198.48712158203125
5535 198.4700164794922
5536 198.4530487060547
5537 198.43605041503906
5538 198.41879272460938
5539 198.40182495117188
5540 198.38479614257812
5541 198.36785888671875
5542 198.350830078125
5543 198.33355712890625
5544 198.31675720214844
5545 198.29966735839844
5546 198.28268432617188
5547 198.26577758789062
5548 198.24859619140625
5549 198.23179626464844
5550 198.2147674560547
5551 198.1977996826172
5552 198.18093872070312
5553 198.16390991210938
5554 198.14720153808594
5555 198.1302490234375
5556 198.11346435546875
5557 198.0964813232422
5558 198.0796661376953
5559 198.0625762939453
5560 198.0460205078125
5561 198.0291748046875
5562 198.01229858398438
5563 197.9954376220703
5564 197.97872924804688
5565 197.9617462158203
5566 197.9449005126953
5567 197.92831420898438
5568 197.9114532470703
5569 197.89454650878906
5570 197.8778533935547
5571 197.86

5895 192.87046813964844
5896 192.85653686523438
5897 192.84219360351562
5898 192.82814025878906
5899 192.81399536132812
5900 192.79974365234375
5901 192.78558349609375
5902 192.77163696289062
5903 192.75753784179688
5904 192.74337768554688
5905 192.72930908203125
5906 192.71524047851562
5907 192.70115661621094
5908 192.68701171875
5909 192.67300415039062
5910 192.65890502929688
5911 192.6448974609375
5912 192.63079833984375
5913 192.61688232421875
5914 192.60264587402344
5915 192.5885772705078
5916 192.57461547851562
5917 192.5607147216797
5918 192.54653930664062
5919 192.53256225585938
5920 192.51864624023438
5921 192.504638671875
5922 192.49061584472656
5923 192.47677612304688
5924 192.46279907226562
5925 192.44871520996094
5926 192.4347381591797
5927 192.42080688476562
5928 192.40692138671875
5929 192.3929443359375
5930 192.37899780273438
5931 192.364990234375
5932 192.35125732421875
5933 192.33731079101562
5934 192.3234100341797
5935 192.30938720703125
5936 192.295654296875
5937 19

6379 186.74473571777344
6380 186.73321533203125
6381 186.72206115722656
6382 186.71080017089844
6383 186.69943237304688
6384 186.68809509277344
6385 186.6768798828125
6386 186.66563415527344
6387 186.6542205810547
6388 186.6429901123047
6389 186.63174438476562
6390 186.62039184570312
6391 186.60916137695312
6392 186.59786987304688
6393 186.58680725097656
6394 186.57546997070312
6395 186.5641326904297
6396 186.55288696289062
6397 186.54176330566406
6398 186.53048706054688
6399 186.519287109375
6400 186.50802612304688
6401 186.49697875976562
6402 186.48562622070312
6403 186.47450256347656
6404 186.4630584716797
6405 186.4520721435547
6406 186.44081115722656
6407 186.4296875
6408 186.4184112548828
6409 186.40740966796875
6410 186.39637756347656
6411 186.385009765625
6412 186.37384033203125
6413 186.36270141601562
6414 186.35157775878906
6415 186.34033203125
6416 186.32928466796875
6417 186.31809997558594
6418 186.3069305419922
6419 186.29580688476562
6420 186.28468322753906
6421 186.27355

7058 180.03472900390625
7059 180.02621459960938
7060 180.0176239013672
7061 180.00888061523438
7062 180.0003204345703
7063 179.99166870117188
7064 179.98306274414062
7065 179.9741973876953
7066 179.96566772460938
7067 179.95713806152344
7068 179.94845581054688
7069 179.93984985351562
7070 179.93138122558594
7071 179.9226531982422
7072 179.9140167236328
7073 179.90554809570312
7074 179.89695739746094
7075 179.88844299316406
7076 179.8797149658203
7077 179.87118530273438
7078 179.8623809814453
7079 179.85382080078125
7080 179.8453826904297
7081 179.83685302734375
7082 179.828369140625
7083 179.81951904296875
7084 179.81103515625
7085 179.80262756347656
7086 179.79396057128906
7087 179.78550720214844
7088 179.77687072753906
7089 179.76828002929688
7090 179.75978088378906
7091 179.7512969970703
7092 179.74270629882812
7093 179.7342071533203
7094 179.72560119628906
7095 179.71714782714844
7096 179.70858764648438
7097 179.70013427734375
7098 179.69149780273438
7099 179.68307495117188
7100 17

7763 174.6110076904297
7764 174.60401916503906
7765 174.59716796875
7766 174.59027099609375
7767 174.5832977294922
7768 174.57669067382812
7769 174.5697479248047
7770 174.56301879882812
7771 174.55609130859375
7772 174.54922485351562
7773 174.5424346923828
7774 174.53565979003906
7775 174.52871704101562
7776 174.5218963623047
7777 174.51487731933594
7778 174.5081329345703
7779 174.50125122070312
7780 174.4944305419922
7781 174.48757934570312
7782 174.48068237304688
7783 174.47384643554688
7784 174.4669952392578
7785 174.46018981933594
7786 174.45346069335938
7787 174.44659423828125
7788 174.43972778320312
7789 174.43299865722656
7790 174.4261474609375
7791 174.4193115234375
7792 174.4125213623047
7793 174.4056396484375
7794 174.39889526367188
7795 174.39205932617188
7796 174.38526916503906
7797 174.37850952148438
7798 174.37155151367188
7799 174.36480712890625
7800 174.3579864501953
7801 174.35113525390625
7802 174.34429931640625
7803 174.33766174316406
7804 174.33079528808594
7805 174

8423 170.45848083496094
8424 170.4526824951172
8425 170.44692993164062
8426 170.441162109375
8427 170.43539428710938
8428 170.42965698242188
8429 170.42385864257812
8430 170.41810607910156
8431 170.41232299804688
8432 170.406494140625
8433 170.40084838867188
8434 170.39503479003906
8435 170.3893280029297
8436 170.3834686279297
8437 170.3777618408203
8438 170.371826171875
8439 170.3662872314453
8440 170.36044311523438
8441 170.35462951660156
8442 170.3488311767578
8443 170.34312438964844
8444 170.33746337890625
8445 170.33163452148438
8446 170.32591247558594
8447 170.32008361816406
8448 170.314453125
8449 170.30868530273438
8450 170.30284118652344
8451 170.29718017578125
8452 170.29144287109375
8453 170.2856903076172
8454 170.2799835205078
8455 170.27415466308594
8456 170.2684326171875
8457 170.26278686523438
8458 170.25701904296875
8459 170.25119018554688
8460 170.2455596923828
8461 170.23980712890625
8462 170.2340545654297
8463 170.22840881347656
8464 170.22259521484375
8465 170.21684

9114 166.7545928955078
9115 166.74966430664062
9116 166.74453735351562
9117 166.73953247070312
9118 166.7345428466797
9119 166.72958374023438
9120 166.7247314453125
9121 166.71974182128906
9122 166.7147216796875
9123 166.70974731445312
9124 166.7047576904297
9125 166.69985961914062
9126 166.69488525390625
9127 166.68975830078125
9128 166.68482971191406
9129 166.6798553466797
9130 166.67486572265625
9131 166.66995239257812
9132 166.66490173339844
9133 166.65994262695312
9134 166.655029296875
9135 166.6499481201172
9136 166.64491271972656
9137 166.64012145996094
9138 166.6351318359375
9139 166.63021850585938
9140 166.625244140625
9141 166.62022399902344
9142 166.61537170410156
9143 166.6103515625
9144 166.60525512695312
9145 166.60028076171875
9146 166.59539794921875
9147 166.59039306640625
9148 166.58555603027344
9149 166.58050537109375
9150 166.57568359375
9151 166.57057189941406
9152 166.5657196044922
9153 166.56068420410156
9154 166.5559844970703
9155 166.55081176757812
9156 166.5458

9744 163.782958984375
9745 163.77850341796875
9746 163.77398681640625
9747 163.7694549560547
9748 163.76507568359375
9749 163.7606658935547
9750 163.7559814453125
9751 163.7516632080078
9752 163.74722290039062
9753 163.74278259277344
9754 163.73818969726562
9755 163.73397827148438
9756 163.7294158935547
9757 163.72482299804688
9758 163.72039794921875
9759 163.71591186523438
9760 163.71148681640625
9761 163.70704650878906
9762 163.70257568359375
9763 163.6981201171875
9764 163.6938018798828
9765 163.689208984375
9766 163.68475341796875
9767 163.68019104003906
9768 163.67587280273438
9769 163.6714324951172
9770 163.6669158935547
9771 163.66244506835938
9772 163.65797424316406
9773 163.65350341796875
9774 163.64910888671875
9775 163.6446533203125
9776 163.64022827148438
9777 163.6357421875
9778 163.63128662109375
9779 163.6268768310547
9780 163.6224365234375
9781 163.61801147460938
9782 163.61337280273438
9783 163.60911560058594
9784 163.60458374023438
9785 163.60015869140625
9786 163.595

In [73]:
for iteration in range(10000):
    # Forward pass: compute predicted y using operations on Tensors. Since w1 and
    # w2 have requires_grad=True, operations involving these Tensors will cause
    # PyTorch to build a computational graph, allowing automatic computation of
    # gradients. Since we are no longer implementing the backward pass by hand we
    # don't need to keep references to intermediate values.
    # predict the values by multiplying x with weight matrix w1, then apply RELU activation and multiply the result by weight matrix w2
    y_pred = F.relu(x.mm(w1)).mm(w2)# RELU function ; the final prediction is given by matrix multiplying the data 
    #with the two set of weights, making the intermediate values non-negative (RELU activation function)

    # calculate the mean squared error (MSE)
    error =  ((y_pred - y) ** 2).sum() 
    # autograd is activated so we can calculate gradient with this command in easier way

    
    writer.add_scalar(tag="Last run",scalar_value= error, global_step = iteration)
    writer.add_histogram("error distribution",error)
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
   
    error.backward()
    # Update weights using gradient descent. For this step we just want to mutate
    # the values of w1 and w2 in-place; we don't want to build up a computational
    # graph for the update steps, so we use the torch.no_grad() context manager
    # to prevent PyTorch from building a computational graph for the updates
    
    
    with torch.no_grad():
        # use w1.grad to update w2 according to the gradient descent formula
        # use w2.grad to update w2 according to the gradient descent formula
        # also use the learning_rate you set before!
        w1 -= learning_rate * w1.grad# your code here
        w2 -= learning_rate * w2.grad# your code here
        
    if iteration % 50 == 0:
        print("Iteration: %d - Error: %.4f" % (iteration, error))
        w1_array.append(w1.cpu().detach().numpy())
        w2_array.append(w2.cpu().detach().numpy())
        errors.append(error.cpu().detach().numpy())
    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()
    if error < 1e-6:
        print("Stopping gradient descent, algorithm converged, MSE loss is smaller than 1E-6")
        break
        
writer.close()

Iteration: 0 - Error: 176.1653
Iteration: 50 - Error: 175.9473
Iteration: 100 - Error: 175.7306
Iteration: 150 - Error: 175.5151
Iteration: 200 - Error: 175.3012
Iteration: 250 - Error: 175.0884
Iteration: 300 - Error: 174.8772
Iteration: 350 - Error: 174.6671
Iteration: 400 - Error: 174.4583
Iteration: 450 - Error: 174.2506
Iteration: 500 - Error: 174.0444
Iteration: 550 - Error: 173.8391
Iteration: 600 - Error: 173.6351
Iteration: 650 - Error: 173.4321
Iteration: 700 - Error: 173.2303
Iteration: 750 - Error: 173.0293
Iteration: 800 - Error: 172.8297
Iteration: 850 - Error: 172.6310
Iteration: 900 - Error: 172.4333
Iteration: 950 - Error: 172.2363
Iteration: 1000 - Error: 172.0405
Iteration: 1050 - Error: 171.8457
Iteration: 1100 - Error: 171.6517
Iteration: 1150 - Error: 171.4587
Iteration: 1200 - Error: 171.2666
Iteration: 1250 - Error: 171.0753
Iteration: 1300 - Error: 170.8851
Iteration: 1350 - Error: 170.6954
Iteration: 1400 - Error: 170.5068
Iteration: 1450 - Error: 170.3191
Ite

# Pytorch Homework Solutions

## Solution for Homework 1

In [None]:
# https://docs.python.org/3/tutorial/errors.html#raising-exceptions to read more about Exception handling
def location_indicator(tensor_):
    indicatorstring = "CUDA" if tensor_.device.type == "cuda" else "CPU"
    print(f"The value of tensor_ is {tensor_} and the tensor location type is {indicatorstring}")
    return indicatorstring

def try_adding_block(x,y,only_convert_one=True):
    print(f"x is the following {x}")
    print(f"y is the following {y}")
    try:
        if x.device.type == y.device.type:    
            if only_convert_one == False:
                z = x.type(torch.DoubleTensor) + y.type(torch.DoubleTensor)
                print("Adding succeeded, objects are in the same memory type") 
            else :
                try :
                    z = x.type(torch.DoubleTensor) + y
                except TypeError:
                    print("Unhandled error thrown because the tensors are of different type!")
                    raise TypeError("Unhandled error thrown because the tensors are of different type!")

                
        else :
             raise TypeError("Adding on different memory banks is not allowed, will result in TypeError!")
            
    except TypeError:
        print("The additives are of different type, addition not implemented for different types of tensors!")
    
    else :
        print("No exception thrown!")
    finally:
        print("End of the function")
                
def try_adding_different_locations(x = torch.ones(3,device="cpu"),\
                                   device = torch.device("cuda"),notboth=True,output_type = "cpu"):
    """
    First a tensor x is created for CPU and the default device is set to be CUDA.
    Then
    """
    if device :
        indicatorstring = location_indicator(tensor_=x)
        if indicatorstring == "CPU":
              x = x.to("cuda", torch.double)     # ``.to`` can also change dtype together!
              print(f"Before the Device type was CPU, but now it is {x.device.type}")
        y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
        print("First we will enforce the data type of both tensors, adding is going to work!")
        try_adding_block(x,y,only_convert_one=False) # convert both to the same type, adding works
        print("\n")
        
        indicatorstring_cuda = location_indicator(tensor_=x)
        if indicatorstring_cuda == "CUDA":
            x = x.to("cpu", torch.double)     # ``.to`` can also change dtype together!
            print(f"Before the Device type was CUDA, but now it is {x.device.type}")
        
       
        indicatorstring_final = location_indicator(tensor_=x) # here the memory type is CPU for one
        
        print("\n")
        try_adding_block(x,y,only_convert_one = False) # adding doesn't work, different memory locations
        
        print("\n")
        print("Now we are enforcing the data type of only one tensor, adding is not going to work!")
        try_adding_block(x,y,only_convert_one = notboth) # converting only one , adding doesnt work
        
        print("\n\n")
        
        # After we convert the Memory type to CUDA for both, adding will work:
        x = x.to("cuda", torch.double)
        try_adding_block(x,y,only_convert_one = False)

    else :
        print("To run this section, please install CUDA as described in the Setting up Pytorch section")
   
    print("Program ended!")

try_adding_different_locations()

In [None]:
Image("assets/img/wikipedia_example_notation.png")

# THIS IS HOW YOU SET UP AN OPTIMIATION PROBLEM

# Gradient Descent in Python :
## Effect of Parameters in Gradient Descent

In [None]:
# Assume that we have been given a generic two variable polynomial function
def two_variable_function(x, y):
    z = x**3 + 2*(x*y) + 3*(y**2) 
    return z

Our goal is to find the global minimum of this function within a specified rectangle from -10 to 10

In [None]:
boundary_grid_values  = [two_variable_function(-10,-10),two_variable_function(-10,10),\
                         two_variable_function(10,-10),two_variable_function(10,10)]
local_extrema_values = [two_variable_function(0,0),two_variable_function(6/27,-2/27)]
min(boundary_grid_values)
#np.min(np.array([boundary_grid_values,two_variable_function(0,0),two_variable_function(6/27,-2/27)]))
if (min(boundary_grid_values) == min(min(local_extrema_values),min(boundary_grid_values))) == True:
    print(f"The minimum amongst the evaluated points is {min(boundary_grid_values)}")

Analyzing this function, we get two stationary points
$(x,y) = (0,0)$ and $(x,y) = (6/27,-2/27)$, since the first derivatives give:

In [None]:
from sympy import symbols
x,y = symbols('x y')
# z = x^3 + 2xy + 3y^2
z = two_variable_function(x, y)
derivatives = z.diff(x,1),z.diff(y,1)
print(derivatives)
# derivatives
# dz/dx = 3*coefficients[0]*x**2 + coefficients[1]*y 
# dz/dy = coefficients[1]*x + 2*coefficients[2]*y
# https://docs.sympy.org/latest/tutorial/calculus.html use that to verify



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

startgrid=-10
endgrid=10.05
a = np.arange(startgrid, endgrid, 0.05)
b = np.arange(startgrid, endgrid, 0.05)

x, y = np.meshgrid(a, b) # creating the evaluation domain grid.
# NB! If the global optimum of the function lies outside the grid, 
# this global optimum would never be discovered since the cost function would never be evaluated there.

z = two_variable_function(x, y)


fig, ax = plt.subplots()
z_min, z_max = z.min(),z.max()
print(z.min())

c = ax.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
ax.set_title('Objective Function Values HeatMap')
# set the limits of the plot to the limits of the data
ax.axis([x.min(), x.max(), y.min(), y.max()])
fig.colorbar(c, ax=ax)
plt.show()

# unravel_index does the inverse. Given a linear index, it computes the corresponding ND index. 
# Since this depends on the block dimensions, these also have to be passed
(x_min_idx,y_min_idx) = np.unravel_index(np.argmin(z), z.shape)

print(f"y minimum location is  {a[x_min_idx]}")
print(f"x minimum location is  {b[y_min_idx]}")
#l2 = b[y_min_idx]
#ax.text(-5, -7.5, l1,fontsize=14)
plt.show()
#ax.legend(loc = 1)


Thus we see that the minimum value is -1033, and that this happens in a semiellipse close to the origin.

Let's see if we get close to -1000 also with gradient descent.

In [None]:
%%time 
#!yes | conda install -n dl -c conda-forge matplotlib   -- to install matplotlib into conda env dl
from IPython.display import HTML
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from sympy import *
import random

# define a 2-variable function z = f(x,y)
def two_variable_function(x, y):
    z = x**3 + 2*x*y + 3*(y**2) 
    return z

def gradient_descent(x_start, y_start, learning_rate, epochs):
    """
    Each following step of the gradient descent depends on the result of the previous step.
    """
    # initialize the grid values as empty lists for the variables
    
    # first run
    x = y = z = []
    x_old = x_start
    y_old = y_start
    
    x.append(x_old)
    y.append(y_old)
    z_gd = two_variable_function(x_old, y_old)
    z.append(z_gd)
    
    # further runs

    # begin the loops to update x, y and z
    for i in range(epochs):
        x_gd = x_old - learning_rate*(3*x_old**2 + 2*y_old)
        y_gd = y_old - learning_rate*(2*x_old + 6*y_old)
        x.append(x_gd)
        y.append(y_gd)
        #print(x)
        #print(y)
        #print(two_variable_function(x, y))
        z_gd = two_variable_function(x_gd, y_gd)
        z.append(z_gd)  # appending the values for z
        # for the next iteration, the new values will be the old values
        x_old = x_gd
        y_old = y_gd

    return x_gd, y_gd, z_gd


####### GIVING THE INITIAL VALUES FOR THE GRADIENT DESCENT

xstart = -1
ystart = -0.5


whatindex = -1
def precpr(x,prec=3):
    return round(x,prec)

lr = np.linspace(0.001,1,7)
xs = np.linspace(-9.9,9.9,4)
ys = xs
idx = 0
best_triplet = ''
best_loss = 1e6
best_z = 1e6
best_x = 1e6
best_y = 1e6
z = two_variable_function(xs,ys)
print(f"Before starting gradient descent, objective function value was: {precpr(two_variable_function(x=xs[0],y=ys[0]))}")

for l in lr:
    for xstart in xs:
        for ystart in ys:
            epochs = random.randint(30,100)
            x_gd, y_gd, z_gd = gradient_descent(x_start=xstart, y_start=ystart, learning_rate=l, epochs = epochs)
            
            #print("\n")
            #print("Last value of x is")
            #print(x_gd[-1])
            if abs((z_gd)-(-1033.33))<best_loss:
                best_loss = abs(z_gd -(-1033.33))
                best_triplet = f"{xstart}_{ystart}_{l}"
                whatindex = idx
                best_x = x_gd
                best_y = y_gd
                best_z = z_gd
            print("\n")
            print(f"Iteration {idx},CONFIGURATION l:{l},xstart:{xstart},ystart:{ystart}")
            print("\n")
            print(f"After gradient descent of {epochs} epochs, the values are:")
            print(f"At the optimum, the objective function value is {precpr(best_z)}")
            print(f"At the optimum, the value of x is {precpr(best_x)} and the value of y is {precpr(best_y)}")
            idx += 1

print(f"The optimal loss was achieved in iteration {whatindex}")
print(f"The found loss coordinates (x,y) are ({precpr(best_x)},{precpr(best_y)})")


We note that sometimes gradient descent might converge to a solution outside the feasible region.
This is why in neural networks, it is important to understand whether the optimizing algorithm used considers the problem from unconstrained or constrained optimization context.

In [None]:
best_loss

## Gradient Descent : Full Example with Dynamic Visualization

In [None]:

def gradient_descent_demo(x_start, y_start, learning_rate, epochs):
    """
    Each following step of the gradient descent depends on the result of the previous step.
    """
    def two_variable_demo_function(x, y):
        #z = -x**4 + 2*(x*y) + 3*(y**2) 
        z = x**2 + 2*(x*y) + 3*(y**2) 

        return z
    # initialize the grid values as empty lists for the variables
    
    # first run
    x_gd=y_gd = z_gd = []
    x_old = x_start
    y_old = y_start
    
    x_gd.append(x_old)
    y_gd.append(y_old)
    z_gd.append(two_variable_demo_function(x_old, y_old))
    
    # further runs

    # begin the loops to update x, y and z
    for i in range(epochs):
        x = x_old - learning_rate*(2*x_old+2*y_old)
        y = y_old - learning_rate*(2*x_old+9*y_old)
        x_gd.append(x)
        y_gd.append(y)
        z_gd.append(two_variable_demo_function(x, y))  # appending the values for z
        # for the next iteration, the new values will be the old values
        x_old = x
        y_old = y

    return x_gd, y_gd, z_gd

x_gd, y_gd, z_gd = gradient_descent_demo(x_start=0.5, y_start=0.3, learning_rate=0.02, epochs = 20)



In [None]:
startgrid=-2
endgrid=2.05
a = np.arange(startgrid, endgrid, 0.05)
b = np.arange(startgrid, endgrid, 0.05)

x, y = np.meshgrid(a, b) # creating the evaluation domain grid.

def two_variable_demo_function(x, y):
    z = x**2 + 2*(x*y) + 3*(y**2) 
    return z
z = two_variable_demo_function(x, y)

# FIND THE ACTUAL MIN coordinates for x and y:
(x_min_idx,y_min_idx) = np.unravel_index(np.argmin(z), z.shape)
# Actual minimum coordinate values
print(x_min_idx)
print(a[x_min_idx])
print(b[y_min_idx])

Thus we see that the global minimum of that convex function in that region is located at $(x,y)=(0,0)$. 

In [None]:
x_gd, y_gd, z_gd = gradient_descent_demo(x_start=0.5, y_start=0.3, learning_rate=0.14, epochs = 10)

############ INITIALIZING THE PLOTTING SYSTEM ###############
def init():
    line.set_data([], [])
    point.set_data([], [])
    value_display.set_text('')

    return line, point, value_display

def animate(i):
    # Animate line
    line.set_data(x_gd[:i], y_gd[:i])
    
    # Animate points
    point.set_data(x_gd[i], y_gd[i])

    # Animate value display
    value_display.set_text('Min = ' + str(z_gd[i]))

    return line, point, value_display

##############################################################



fig1, ax1 = plt.subplots()

ax1.contour(x, y, z, levels=np.logspace(startgrid, endgrid, 15), cmap='CMRmap')
# Plot target (the minimum of the function)

# PLOT THE ACTUAL MIN POINT 
min_point = np.array([0., 0.])
min_point_ = min_point[:, np.newaxis]

ax1.plot(*min_point_, two_variable_demo_function(*min_point_), 'r*', markersize=10)
ax1.set_xlabel(r'x')
ax1.set_ylabel(r'y')
''' Animation '''
# Create animation
line, = ax1.plot([], [], 'r', label = 'Gradient Descent on Convex Function', lw = 2.0)
point, = ax1.plot([], [], 'bo')
value_display = ax1.text(0.02, 0.02, '', transform=ax1.transAxes)

ax1.legend(loc = 1)

anim = animation.FuncAnimation(fig1, animate, init_func=init,
                               frames=len(x_gd), interval=120, 
                               repeat_delay=60, blit=True)

HTML(anim.to_jshtml())

So we have seen, why gradient descent is not the most optimal optimizer:
    - Finds local optima, not global. Always think whether the optimization problems is convex or concave.
    - Only gradient descent itself means **unconstrained** optimization. If there are constraints to the domain, then gradient descent doesn't follow  those!  If we have a continuous function on some domain, then the boundary values have to be checked because the cost function value at those might be more optial
    - The optimization may not converge, it may start oscillating or diverge
        - Choice of step size is crucial. Too big step size can result in exploding gradient, too small step size in vanishing gradient
    - Using only gradient information, we have only 1st order information about the function. There are 2nd order methods such as L-BFGS, Conjugate gradient and Newton method that have improved convergence properties, but at higher computational cost

## Binary Classification Lost Function


Choosing the right cost function for achieving the desired result is a critical point of machine learning problems. The basic approach, if you do not know exactly what you want out of your method, is to use [Mean Square Error (Wikipedia)](https://en.wikipedia.org/wiki/Mean_squared_error) for regression problems and Percentage of error for classification problems. However, if you want _good_ results out of your method, you need to _define good_, and thus define the adequate cost function. This comes from both domain knowledge (what is your data, what are you trying to achieve), and knowledge of the tools at your disposal. 

I do not believe I can guide you through the cost functions already implemented in TensorFlow, as I have very little knowledge of the tool, but I can give you an example on how to write and assess different cost functions.

---

To illustrate the various differences between cost functions, let us use the example of the binary classification problem, where we want, for each sample $x_n$, the class $f(x_n) \in \{0,1\}$.

Starting with **computational properties**; how two functions measuring the "same thing" could lead to different results. Take the following, simple cost function; the percentage of error. If you have $N$ samples, $f(y_n)$ is the predicted class and $y_n$ the true class, you want to minimize

* $\frac{1}{N} \sum_n \left\{
\begin{array}{ll}
1 & \text{ if } f(x_n) \not= y_n\\
0 & \text{ otherwise}\\
\end{array} \right. = \sum_n y_n[1-f(x_n)] + [1-y_n]f(x_n)$.

This cost function has the benefit of being easily interpretable. However, it is not smooth; if you have only two samples, the function "jumps" from 0, to 0.5, to 1. This will lead to inconsistencies if you try to use gradient descent on this function. One way to avoid it is to change the cost function to use probabilities of assignment; $p(y_n = 1 | x_n)$. The function becomes

* $\frac{1}{N} \sum_n y_n p(y_n = 0 | x_n) + (1 - y_n) p(y_n = 1 | x_n)$.

This function is smoother, and will work better with a gradient descent approach. You will get a 'finer' model. However, it has other problem; if you have a sample that is ambiguous, let say that you do not have enough information to say anything better than $p(y_n = 1 | x_n) = 0.5$. Then, using gradient descent on this cost function will lead to a model which increases this probability as much as possible, and thus, maybe, overfit.

Another problem of this function is that if $p(y_n = 1 | x_n) = 1$ while $y_n = 0$, you are certain to be right, but you are wrong. In order to avoid this issue, you can take the log of the probability, $\log p(y_n | x_n)$. As $\log(0) = \infty$ and $\log(1) = 0$, the following function does not have the problem described in the previous paragraph:

* $\frac{1}{N} \sum_n y_n \log p(y_n = 0 | x_n) + (1 - y_n) \log p(y_n = 1 | x_n)$.

This should illustrate that in order to optimize the _same thing_, the percentage of error, different definitions might yield different results if they are easier to make sense of, computationally.

**It is possible for cost functions $A$ and $B$ to measure the _same concept_, but $A$ might lead your method to better results than $B$.**

---
In conclusion, defining the cost function is defining the goal of your algorithm. The algorithm defines how to get there.


### What is the relation between backpropagation and Auto-differentiation

    When applying backpropagation, Gradient descent calculation should be implemented to find minimum loss or error in each loop, auto differentiation constructs this part in place of you no need to how to implement gradient descent.
    Basically backpropagation itself is an optimization technique for neural networks while auto differentiation is a calculus method. When these two method from different field are combined, autograd algorithm is occurred as a result which is better way and easier to implement. 