# Magic Commands, Helper Functions and Decorators

In [5]:
import os
print(os.__file__)

/home/zak/anaconda3/lib/python3.7/os.py


In [6]:
from __future__ import print_function
from IPython.display import Image
import torch
import numpy as np
# load the autoreload extension
%load_ext autoreload
# Set extension to reload modules every time before executing code
#%autoreload 2
#
## Easy to read version
#%system date
#
## Shorthand with "!!" instead of "%system" works equally well
#!!date
#!!ls
#
## Outputs a list of all interactive variables in your environment
#%who_ls
#
## Reduces the output to interactive variables of type "function"
#%who_ls function

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



What is and WHY Use PyTorch?
================

It’s a Python-based scientific computing package targeted at two sets of
audiences:

-  **An extensible alternative for NumPy harnessing the power of GPUs**
-  a **deep learning research platform that provides maximum flexibility
   and speed**

Getting Started
----------------------------

To get started, please open the [documentation get-started](https://pytorch.org/get-started/locally/)



## Comparison to Tensorflow 1. Pytorch advantages

Please read the following questions and try to guess the answers.
<ol>
  <li>What language has Torch been written in?</li>
  <li>Where does the name PyTorch come from?</li>
   <li> What  <a href="https://en.wikipedia.org/wiki/Programming_paradigm">programming paradigm </a>  is PyTorch built upon? Choose between declarative, procedural, imperative and functional. </li>
    <li> What is the main difference between Tensorflow1 and PyTorch in terms of runtime execution? </li>
    <li> Why graph dynamic execution in Pytorch better than static graph execution used e.g. in Tensorflow? </li>
</ol>




Now please watch the [video](https://www.youtube.com/watch?v=nbJ-2G2GXL0) and compare your prior guesses to the answers given in the video. Write down what you learned.

:)

## Why Learn Pytorch? 

Every new learning would be made so much easier by gathering motivation for it. This is based on the famous <a href="https://hoishampark.wordpress.com/2017/04/14/motivation-hacker-a-book-report/">MEVID </a>  formula
Now you might be wondering that PyTorch is a cool and versatile imperative programming based deep learning framework,in which the computations are handled dynamically. 

But..

**Why should I learn it ?**

**Simple answer : Best both for short and long research projects!**

But there is much more!

[This article will give you the motivation to learn PyTorch](https://www.analyticsindiamag.com/9-reasons-why-pytorch-will-become-your-favourite-deep-learning-tool/) 

# Setting Up CUDA and Pytorch

## Setting Up CUDA

It's a good practice to develop the habit of having the right tools for each job.
- If you have a separate NVIDIA GPU in your disposition, PyTorch should be used with CUDA to speed up computation. In this case, please [install CUDA](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html)


- Otherwise you may use Google Colaboratory through which you can use Tesla K80 GPU for free! Please see [this post](https://medium.com/@ml_kid/google-colab-notebook-with-pytorch-v1-0-stable-lesson-9-46433881da05) on how to set up Google Colaboratory with Tesla. The first commands needed to run are specified in [this notebook](https://colab.research.google.com/drive/18R3Rz639Fa4ByFwzLrY5LmD8t3Ee5IUR) . You can always run any Jupyter Notebook on Google Colaboratory. To do this, just upload the notebook to Google Colaboratory while inside Colaboratory.
<div class="alert alert-warning" role="alert">
  Warning: You may run into authentication issues while trying to persist your variables on Colaboratory, especially after you wish to keep your session open to re-run it later. The free instance is periodically turned off and re-authentication may be required. Otherwise Colaboratory might complain about missing files (this may be because the session was disconnected at some point).
</div>


- Since the majority of this course is not GPU-intensive, you may also run the Notebooks on your CPU. 


In [7]:
# let us run this cell only if CUDA is available
# Print out your first tensor. Update the code with one line to print out your first tensor object
cuda0 = None
if torch.cuda.is_available():
    print("Your Pytorch runtime uses GPU")
    cuda0 = torch.device('cuda:0')
    first_tensor_on_cuda = torch.ones([2, 4], dtype=torch.int32, device=cuda0)
    print(first_tensor_on_cuda)
else :
    print("Your Pytorch runtime uses CPU")
    first_tensor_on_cpu = torch.ones([2, 4], dtype=torch.int32)



Your Pytorch runtime uses CPU



IF you wish to practice using CUDA, please install   <a href=" https://wiki.tiker.net/PyCuda/Installation"> PyCuda.</a>

Why you should use  <a href="https://devtalk.nvidia.com/default/topic/573367/pucuda-pros-and-cons/">PyCuda instead of C</a> 


In [8]:

import pycuda.driver as cuda
cuda.init()
## Get Id of default device
torch.cuda.current_device()

cuda.Device(0).name()

ModuleNotFoundError: No module named 'pycuda'


## Setting Up Tensorboard -- Visualization Platform (Works on Tensorflow, Pytorch, Keras,...)
It's always good to see the visual outputs of the code you are writing, and  especially in machine learning.Typically machine learning engineer would plot the underlying neural network as a graph, training error of the network during training time, visual outputs of the hidden layers.
That is why we give two examples of platforms that have native integration with PyTorch.

Popular Visualization platforms for Pytorch are [Visdom](https://github.com/facebookresearch/visdom) and [TensorboardX](https://github.com/lanpa/tensorboardX). According to the discussion on [reddit](https://www.reddit.com/r/MachineLearning/comments/8ej2j4/d_facebook_visdom_vs_google_tensorboard/) ,people use both, both for different purposes.

[How to use TensorboardX](http://www.erogol.com/use-tensorboard-pytorch/)

<div class="alert alert-success" role="alert">
  Followingly, the commands ran assume that you have a  Conda Virtualenv called dl created , and we are going to install the packages there.

</div>


In [9]:
#!yes | conda install -n dl -c conda-forge tensorboardx
# start Tensorboard instance
#! yes |conda install -n dl -c conda-forge tensorflow 

# ! yes |conda install -n dl -c conda-forge tensorboard 
log_path = './runs/gd/'
!tensorboard --logdir log_path
# to add more objects to Tensorboard, please read the manual

/usr/bin/sh: 1: tensorboard: not found


# Basic Data Structures in PyTorch. Tensors.

## Tensor Definition and Some Important Properties
### Definition of a n-th Order Tensor

**Practical Definition of a Tensor** :
An **n-th Order Tensor** is a n-dimensional array of numbers.
In this definition, each dimension is considered to be  independent of each other.

By *practical* we mean that this is the definition that is used in computer programming and software libraries that are used in the industry and practice.

The word *tensor* comes from physics and was initially used to describe the **tension** on materials. Since it was necessary to  describe the tension on each face of a solid body, a simple 2D matrix was enough, since the first dimension could be used to denote the normal direction of the face and the 2nd dimension the direction of the tension. 


<div class="alert alert-info"><h4>Comparison to mathematical definition of tensor</h4><p>
This definition differs significantly from the mathematically rigorous definition of a tensor, in which case a n=(p+q)-order or (p,q)-tensor is defined as multilinear mapping that is linear with respect to each of its arguments (p vectors and q  co-vectors or differential forms) that retains certain invariants under a coordinate transformation. Thus, in mathematics, only those multilinear mappings are tensors that retain its invariants.
    </p></div>

**Henceforth we are only going to use the practical definition of a tensor.**





## Tensor Interpretation in Programming Context
<div class="alert alert-success" role="alert">
Tensors are similar to Python's NumPy’s ndarrays but they have the additional property that
Tensor data structure can be scaled up horizontally and thus can also be used on a GPU to accelerate computing. 
</div>

[Tensor](https://en.wikipedia.org/wiki/Tensor) is also the **basic data structural unit in PyTorch**.


A 2D-matrix is an example of a 2-nd order tensor, but a specific example of a n-th order **tensor**, which in general has  **n** independent components. Consider the following example:


In [10]:
#Image("assets/img/tensor.jpg")
# Source : https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwirxuj998nhAhUqtYsKHfFbAhUQjRx6BAgBEAU&url=https%3A%2F%2Fwww.slideshare.net%2FBertonEarnshaw%2Fa-brief-survey-of-tensors&psig=AOvVaw0tULmnEC2-vr346HuYGbdQ&ust=1555137171132510

We see that to represent a single pixel on an image, we need 3 independent components:

    > 1-st component denotes the x-location (width-location)
    > 2-nd component denotes the y-location (height-location)
    > 3-rd component denotes the color channel (R,G,B) since any colour displayed on a computer screen is formed from the 3x 8-bit (R,G,B)-triplet each having values between 0 and 255.
    
Thus we can think of a cat image on a computer screen  of as a discrete rectangular prism or a multidimensional array of numbers as shown on the above image. This rectangular n-dimensional array of numbers is called an **n-th order tensor**.

This conceptually natural connection between image and a tensor is very important since this forms the basis why GPUs are used in machine learning.
GPU is a graphical processing unit, usually containing many computational cores (many more than CPU, central processing unit) and they are optimized for operations that are done with images, which we now know to be n-dimensional tensors. 

*Thus it is believable why GPUs should be suitable for doing computations with tensors or high-dimensional data.*


### Tensors and Operations with Tensors in Pytorch

Followingly, let's see how to form tensors in PyTorch.
Tensors can be created from both Python base class data (e.g. list of lists) as well as from Numpy array data.
To get started, please check the [documentation](https://pytorch.org/docs/stable/tensors.html)


In [11]:
tensor_from_2d_list = torch.tensor([[1., -1.], [1., -1.]])


# Print the tensor object created from Python List 
# Write your code here ...
print(tensor_from_2d_list)


tensor_from_np_array = torch.tensor(np.array([[1, 2, 3], [4, 5, 6]]))


# Print the tensor object created from Python Numpy array 
# Write your code here ...
print(tensor_from_np_array)

tensor([[ 1., -1.],
        [ 1., -1.]])
tensor([[1, 2, 3],
        [4, 5, 6]])


Following are some common ways to create a [Tensor](#Definition-of-a-n-th-Order-Tensor):

In [12]:
a = torch.empty(2,2)
b = torch.zeros(2, 2, dtype=torch.long)
c = torch.rand(5, 5)

d = a.new_ones(10, 10, dtype=torch.double)
e = torch.randn_like(d, dtype=torch.float) 
print(e)

tensor([[-0.3949, -0.7000, -0.1099, -0.5855, -0.0478, -0.6899,  0.4303, -0.3802,
          1.3416,  0.8997],
        [-1.5337,  0.7441,  0.0947,  1.6688,  2.0297, -0.3358, -0.9187,  0.1260,
         -0.8142,  0.6057],
        [ 0.2501,  0.0669,  0.1261, -1.5834,  0.1206,  0.2612, -1.8007,  0.3009,
         -0.2981, -1.7850],
        [ 0.5373,  0.5813,  0.4109, -0.0329,  1.6040,  0.4584, -0.3362,  0.7265,
         -0.8993, -1.8596],
        [ 2.1138, -0.2286, -0.6182,  0.4350, -0.9528,  0.8274, -0.1448,  1.0237,
         -0.3087,  1.1539],
        [-1.3540, -0.0530,  0.9784, -0.0112,  0.7636, -1.7852, -1.2121,  0.5084,
          0.0698, -1.2917],
        [ 1.5280, -1.0824, -2.4958,  0.5692,  0.5131, -0.3709,  1.6278,  0.3315,
          0.6339, -0.3235],
        [-0.7250,  0.3345, -0.4160, -1.0180, -1.1332,  0.4357,  0.7813,  0.2541,
          0.5535,  0.4031],
        [-1.0871,  0.6357,  0.4488,  0.7153,  2.0557, -0.9471, -1.8353,  0.1855,
         -0.4933,  0.6245],
        [-1.6647,  

Questions:
1. What is the difference between tensors a and b?
Q.1: b contains zeros in shape 2x2, a contains only not assigned values as empty
2. What about d and e? When would you use randn_like function?
Q.2: d contains ones in shape 10x10, e contains randon floats in shape of d.
    

Followingly see the available methods on the tensor object by clicking x. and TAB
To see the DOCSTRING of the specific function, select or type the chosen function and then click SHIFT+TAB.

In [13]:
#Image("assets/img/tabcomplete.png")

        

###  Tensor to Numpy array conversion



In [14]:
test_array = np.arange(16)
test_tensor = torch.from_numpy(test_array)
test_array2 = test_tensor.numpy()
print(f"The type of test_array is, {type(test_array)}\n")
print(f"The type of test_tensor is, {type(test_tensor)}\n")
print(f"Are the shapes of the objects are equal? : {test_array.shape == test_tensor.shape}\n")
print(f"After converting the tensor back to Numpy array, is the initial array equivalent to the converted array?")
print(f"{all(test_array == test_array2)}")

The type of test_array is, <class 'numpy.ndarray'>

The type of test_tensor is, <class 'torch.Tensor'>

Are the shapes of the objects are equal? : True

After converting the tensor back to Numpy array, is the initial array equivalent to the converted array?
True


### Tensor Reshaping : View

In [15]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8) 
print(x.size(), y.size(), z.size())

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])


View is a similar function to Numpy's reshape. What is the meaning of parameter -1?

<div class="alert alert-info"><h4>Note</h4><p>-1 means that if we don't know what is the dimensionality count in a particular dimension, leave it unspecified by writing  -1. In this case, the number of samples in this dimension is inferred from the other dimensions, e.g. if you have a 4 x 4 array x and you use x.view(-1,8), then -1 stands for 2.</p></div>



## Tensors on CUDA

The natural representation of many types of data is a in a form of a high-dimensional array. 
GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit.
Next we are going to define the CUDA device, if available:

In [16]:
# create_formatted_var("device")
# We will use  
if cuda0 is not None:
    device = cuda0
    print("Cuda device cuda0 loaded before")
elif torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
else :
    print("Your system doesn't have CUDA")

Your system doesn't have CUDA


To save a data structure object directly to GPU, please use the ``device`` argument.
If the data object was created before in a memory location other than GPU-memory, use the ``tensorobject.to(device)`` syntax to transfer the data structure from memory to GPU memory.

In [17]:
if cuda0 is not None:
    x = torch.ones(3,device="cuda")
else :
    cuda0 = 'cpu'
    x = torch.ones(3,device=cuda0)
x.device.type

'cpu'

# Basic Transformations in Pytorch

## Addition of Tensors

Before learning any transformations, it would be useful to know how quickly a  given transformation is going to run. For that purpose, please feel free to make use of the iPython Magic Command %%timeit , that creates a loop and evaluations its running time in Jupyter Notebook.

Take 5 minutes to go over the next TOP 5 Magic commands in Jupyter, it might you save a lot of time later

[TOP 5 Jupyter Magic commands](https://towardsdatascience.com/the-top-5-magic-commands-for-jupyter-notebooks-2bf0c5ae4bb8)

In [18]:
%%time
x = torch.tensor([1,2,3],device=cuda0)
y = torch.tensor([5,2,1],device=cuda0)

# way 1 :
result = torch.add(x,y)
print(result,y)

print(x)

tensor([6, 4, 4]) tensor([5, 2, 1])
tensor([1, 2, 3])
CPU times: user 3.36 ms, sys: 0 ns, total: 3.36 ms
Wall time: 8.1 ms


Addition: providing an output tensor as argument



In [19]:
# adds x to y
#y.add_(x)
x = torch.tensor([1,2,3],device=cuda0)
print(x)

tensor([1, 2, 3])


<div class="alert alert-info"><h4>Note</h4><p>Any operation that mutates a tensor in-place is post-fixed with an ``_``.
    For example: ``x.copy_(y)``, ``x.t_()``, will change ``x``.</p></div>

You can use standard NumPy-like indexing with all bells and whistles!



### Homework Exercise on Tensor Adding Error Handling:

Write a function ``try_adding_different_locations`` that has 4 input arguments:
- x : This is the input tensor, it should be passed as a default argument, that is 2nd-order 3x3 tensor of ones, created on CPU
- device: This should be passed as a default argument, and it is the CUDA device available on your system
- notbtoh : bool. This argument should be passed as boolean with True as default value
- output_type : string with default_value "cpu"

TASK: **The function should probe out all combinations in which the tensor addition should or should not work.**

In [20]:
def try_adding_different_locations(x = torch.tensor([1,2,3]) , device = cuda0, notbtoh = True, output_type = 'cpu'):
    A = x
    print(A)
                                   
    

## Matrix Multiplication on Tensor Objects using mm

From linear algebra course at school, you may remember the concept of matrix multiplication.
Since  image is a tensor, and tensor is a generalization of a matrix, matrix multiplication also works for tensors.

In a course taught in universities, called **Tensor Calculus**, the defined operations between tensors include tensor addition, tensor product and contraction, for example.

**Tensor product is not implemented in PyTorch**.
Practical applications reduce to to squeezing dimensions over 2, resulting in a 2-D tensor (or matrix), when multiplication is needed, thus the PyTorch multiplication is actually a matrix multiplication.

To do tensors multiplication, use ``torch.mm`` or ``torch.matmul`` .
<div class="alert alert-warning" role="alert">
Don't confuse ``dot`` and ``mm`` operators, check <a href="https://stackoverflow.com/questions/44524901/how-to-do-product-of-matrices-in-pytorch" class="alert-link">this link.</a>
 
To get element-wise product, use A*B.
</div>


In [21]:
a = torch.from_numpy(np.array([[1,2,3],[4,5,6]]))
b = a
print(a.shape)
print(b.shape)
c = torch.mm(a,b)# doens't work, do you know why?


torch.Size([2, 3])
torch.Size([2, 3])


RuntimeError: size mismatch, m1: [2 x 3], m2: [2 x 3] at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/TH/generic/THTensorMath.cpp:752

**Can the tensors a and  b be matrix-multiplied? Why / Why not?**

To matrix-multiply objects, the last dimension of the first object needs to coincide with the first dimension of the 2nd object.
If we have two objects with identical dimensionalities, one way to make the multiplication compatible is  to transpose either of those objects. Let's try it.

In [22]:
c = torch.mm(a,b.t())
d = torch.mm(a.t(),b)
print(c)
print(d)

tensor([[14, 32],
        [32, 77]])
tensor([[17, 22, 27],
        [22, 29, 36],
        [27, 36, 45]])


## In-Place Operations

Guess what is the following code going to print:

In [23]:
a = torch.tensor(np.array([[1,2,3],[4,5,6]]))
b = a
try:
    torch.mm(a,b.t())
    torch.add(a,b)
    print("First multiplication and addition succeeded")
except :
    print("First  multiplication failed")

try :
    print(f"The result of in-place addition is {a.add_(b)}")
    print("Second addition succeeded")
    (a.t()).mm_(b)
    print("Second multiplication succeeded")
except:
    print("Second  multiplication failed")


First multiplication and addition succeeded
The result of in-place addition is tensor([[ 2,  4,  6],
        [ 8, 10, 12]])
Second addition succeeded
Second  multiplication failed


Was your guess correct?

The crux lies in the _ meaning: We can often find underscored_versions of methods in PyTorch, which stand for in-place operations. That means that the object on which the operation is called, is changed in the computer memory, without explicitly saving the result.

Thus, often a binary operations $$(x , y)  \rightarrow z$$ that operates on two arguments can be made to be unary,
resulting in

$$x.method(y) \rightarrow z$$,

operating only on one argument, while the object the method is operating on (in this case ``x``), is meant to be overwritten in the computer memory, since the operation is done in-place.


<div class="alert alert-warning" role="alert">
    Notice however that multiplication as defined by ``mm`` or ``matmul`` or ``@`` does not have an in-place version, this it is always binary.
</div>


## Neural Network Definition
[Neural Networklink on Wikipedia](https://en.wikipedia.org/wiki/Neural_network)

A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes.  
The connections of the biological neuron are modeled as weights denoted by $w$ in the following graph. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed . Finally, an activation function (in the following graph, denoted by $f$) controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1. It is said that each neuron in the following layer gets the linear combination of activations (referring to the use of $f$ in the image below) of neurons from the previous layer. 

In [24]:
from IPython.display import Image
#Image("assets/img/neural_network.png")

## Autograd : Automatic Differentiation

<div class="alert alert-success" role="alert">
  For the definition of neural network, please refer to the following link:
</div>



[NEURAL NETWORK DEFINITION](hb#Neural-Network-Definition)

**Autograd**, as one of the modern automatic differentiation frameworks, lies in the **core of Pytorch**.

Thus, it would be a very good idea to [Read more about Autograd](https://pytorch.org/docs/master/notes/autograd.html)

Autograd package in PyTorch automates the computation of backward passes in computational graphs.

In Autograd: 
- the forward pass of the neural network will define a computational graph
- nodes in the graph will be Tensors
- edges are functions producing output Tensors from input Tensors
- the backward pass of the neural network is for easy computation of gradients for those tensors  ``x``, for which  ``x.requires_grad``= ``True``, with respect to some scalar.

**Thus, to compute gradients of some variables, use functions that allow to pass  ``x.requires_grad``= ``True``.**

In [25]:
x = torch.tensor([2],requires_grad=True,dtype=torch.float)
y = torch.tensor([3],requires_grad=True,dtype=torch.float)
z = x + y 
z.backward()
print(f"Gradient with respect to z is {z.grad}")
print(f"Gradient with respect to x is {x.grad}")
print(f"Gradient with respect to y is {y.grad}")

Gradient with respect to z is None
Gradient with respect to x is tensor([1.])
Gradient with respect to y is tensor([1.])


Here we saw the most basic way to get a derivative or gradient.
We cannot compute the gradient of z, because it has not been initialized with the argument requires_grad=True.
We can though compute the gradient of x, and since derivative from x w.r.t. x is 1, the result is tensor([1.]), since the chosen data type was floating point number.

<div class="alert alert-warning" role="alert">
 Note: Often in neural networks, x and y are symbols used for input or output data and usually we are never required to compute gradients w.r.t. to the data.
    
The reason is simple:
When we have a model, e.g. like a neural network that is a black  box model, the model itself shouldn't have any assumptions or explicit dependencies on the data. It's up to the machine learning engineer to define the model or network structure in a way that the model learns from the data, but the underlying machinery should not hold explicit dependency on the data.
</div>

*Let's take a look at another example*

In [26]:
x = torch.tensor([1, 2], requires_grad=True, dtype=torch.float)
y = torch.tensor([1, 3], requires_grad=True, dtype=torch.float)
print(f"At the beginning, x is {x}")
print(f"At the beginning, y is {y}")

x = x ** 2
y = y ** 2

print(f"After squaring, x is {x}")
print(f"After squaring, y is {y}")

x.add_(y)

print(f"After adding y to x in-place, the value of x is {x}")

z = x.sum()
print(f"The sum of the elements in this array is {z}")
z.backward()
print("Finally, after running the backward step to compute the gradients, we get:")
print(z)
# Note that z has no gradient requirement by default, thus z.grad prints None.
print(f"The gradient of z is {z.grad}")

At the beginning, x is tensor([1., 2.], requires_grad=True)
At the beginning, y is tensor([1., 3.], requires_grad=True)
After squaring, x is tensor([1., 4.], grad_fn=<PowBackward0>)
After squaring, y is tensor([1., 9.], grad_fn=<PowBackward0>)
After adding y to x in-place, the value of x is tensor([ 2., 13.], grad_fn=<AddBackward0>)
The sum of the elements in this array is 15.0
Finally, after running the backward step to compute the gradients, we get:
tensor(15., grad_fn=<SumBackward0>)
The gradient of z is None


### Verifying the correctness of Autograd Output

In [27]:
from __future__ import print_function
from torch.autograd import Variable
import torch
import numpy as np
# important : every time before generating random numbers, you should set the seed!
torch.manual_seed(0)
np.random.seed(0)

x = Variable(torch.randn(1,1), requires_grad = True)
print(f"The value sampled from normal distribution gives {x} \n")
y = 3*x
print(f"When we form a linear function of {x} by multiplying this sample value by 3, we get {y}\n")
z = y**2 # derivative w.r.t. x is:
print(f"After that we form a quadratic function of y, and at this sample value, it takes a value of {z} \n")
print("Now we are going to register the hooks for each variable \n")
x.register_hook(print)
y.register_hook(print)

#z.register_hook(print)
z.backward(retain_graph=True) # prints all the gradients needed
y.backward(retain_graph=True) # prints all the gradients needed

print(f"Gradients obtained from the first graph are {z.grad} \n")
#print(f"Gradients obtained from the second graph are {z2.grad} \n")





The value sampled from normal distribution gives tensor([[1.5410]], requires_grad=True) 

When we form a linear function of tensor([[1.5410]], requires_grad=True) by multiplying this sample value by 3, we get tensor([[4.6230]], grad_fn=<MulBackward0>)

After that we form a quadratic function of y, and at this sample value, it takes a value of tensor([[21.3720]], grad_fn=<PowBackward0>) 

Now we are going to register the hooks for each variable 

tensor([[9.2460]])
tensor([[27.7379]])
tensor([[1.]])
tensor([[3.]])
Gradients obtained from the first graph are None 



You might have seen that z.grad doesn't print anything, but to see the gradient components, you have to register the hook for that particular variable.

**The same example with the gradient conflict:**

In [28]:
torch.manual_seed(0)
x = Variable(torch.randn(1,1), requires_grad = False)
y2 = Variable(3*x,requires_grad=False)
#y2.register_hook(print)
z2 = Variable((y2)**2,requires_grad=True)
z2.register_hook(print)
z2.backward()
print(f"The value of y is {y}")
print(z2.grad)

tensor([[1.]])
The value of y is tensor([[4.6230]], grad_fn=<MulBackward0>)
tensor([[1.]])


In [29]:
print(f"The z value is {9*1.54**2}")
print(f"The grad_y(z) is {6*1.54}")
print(f"The grad_x(z) is {18*1.54}")

The z value is 21.3444
The grad_y(z) is 9.24
The grad_x(z) is 27.72


### Recap : How to properly pass tensors and find their gradients

In [30]:
from __future__ import print_function
from torch.autograd import Variable
import torch
torch.manual_seed(0)
xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz = yy**2

yy.register_hook(print)
zz.backward()

tensor([[9.2460]])


To be sure that the differentiation is working correctly, let us compute the following:
$$A(x)  = \partial_x z(y(x)) = \partial_x (y^2(x)) = \partial_x (9x^2) = 18x$$.
If $x = -1.5298$, then $$A(x) = 18 \cdot x = -27.5364 \approx -27.5356$$ 

and we see how Pytorch could be used as a symbolic mathematics framework to compute e.g. derivatives, since the chain rule is implemented there internally. With computationally intensive tasks, the effective implementation of automatic differentiation could be a huge benefit!

In [31]:
18*(-1.5298) 

-27.5364

One could observe that PyTorch's grad isn't really a gradient, because the normally known gradient as  in fact a directional derivative. If we wish to take the derivative from x w.r.t. x, we get:

In [32]:
x = torch.tensor([2,4,5],requires_grad=True,dtype=torch.float)
x.register_hook(print)
x.backward()
print(f"Gradient with respect to x is {x.grad}")

RuntimeError: grad can be implicitly created only for scalar outputs

<div class="alert alert-warning" role="error">

RuntimeError: grad can be implicitly created only for scalar outputs
</div>

## Autograd Application : Gradient Descent in PyTorch

Next we are going to take a look at a basic example, adopted from [Github demo for Autograd](https://github.com/jcjohnson/pytorch-examples#pytorch-autograd)

PLEASE MAKE SURE THAT log_path variable is initialized properly
[in setup](#Setting-Up-Visualization-Platform-on-Pytorch)

In [33]:
cmd = "pip install --upgrade tensorboardX --user "
pw = "data"
!echo {pw}|  {cmd}

Requirement already up-to-date: tensorboardX in /home/zak/.local/lib/python3.7/site-packages (1.8)


In [34]:
from tensorboardX import SummaryWriter
log_path = './runs/gd/'

if log_path:
    print("accessing predefined path")
    writer = SummaryWriter(log_dir=log_path)
else :
    print("using new path set")
    writer = SummaryWriter(log_dir='./runs/gd/')
#In addition to SummaryWriter, there are also other writers, please check the manual
# https://tensorboardx.readthedocs.io/en/latest/tutorial.html

# !tensorboard --logdir log_path --host localhost --port 8088
# you have to execute tensorboard command from another shell, otherwise you cannot proceed with running the notebook
# read this : https://medium.com/@anthony_sarkis/tensorboard-quick-start-in-5-minutes-e3ec69f673af


accessing predefined path


In [35]:
writer

<tensorboardX.writer.SummaryWriter at 0x7fab87624710>

In [36]:
from torch.distributions import normal
m = normal.Normal(4.0, 5.0)
m

Normal(loc: 4.0, scale: 5.0)

In [72]:
# Code in file autograd/two_layer_net_autograd.py
import torch
import torch.nn.functional as F

#device = torch.device('cuda') # Uncomment this to run on GPU
device = torch.device('cpu') # Uncomment this to run on CPU

# N is batch size;
# D_in is input dimension;
# H is hidden dimension; 
#D_out is output dimension.
N, D_in, H, D_out = 64, 350, 50, 20

# input data : batch of 64 times 1000 features
# output data : 100 x 10 continues values (real scalars)

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device, dtype=torch.float)# generate normally distributed data of dim NxD_in, store it on device, requires_grad = False
y = torch.randn(N, D_out,device=device,dtype=torch.float)# generate normally distributed data of dim NxD_out, store it on device, requires_grad = False

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
# here the gradient will be computed for the variables that are related to the model learning something new,
# i.e. the network weights in this case


# WEIGHTS
w1 = torch.randn(D_in, H, device=device, requires_grad=True)#generate normally distributed data of dim D_in x H, store it on device, requires_grad = True 
w2 = torch.randn(H, D_out, device=device, requires_grad=True)#generate normally distributed data of dim H x D_out, store it on device, requires_grad = True 

# initialize loss value to a high number 

# initialize arrays errors, w1_array and w2_array to empty lists
errors = []# write here
w1_array = [] # write here
w2_array = [] # write here
 # set the network learning rate parameter 'learning_rate' to some small number
learning_rate = 1e-6
for epoch in range(10000):
   y_pred = x.mm(w1).clamp(min=0).mm(w2)
   loss_value = ((y_pred - y) ** 2).sum()
   print(epoch, loss_value.item())
   loss_value.backward()
   with torch.no_grad():
       w1 -= learning_rate * w1.grad
       w2 -= learning_rate * w2.grad
       # Manually zero the gradients after updating weights
       w1.grad.zero_()
       w2.grad.zero_()

0 12302852.0
1 8019552.0
2 6195708.5
3 5222990.0
4 4599080.0
5 4140169.5
6 3773069.75
7 3464335.75
8 3197106.25
9 2961979.75
10 2753143.5
11 2566442.5
12 2398336.0
13 2246993.5
14 2109772.5
15 1984959.25
16 1871135.25
17 1767051.125
18 1671560.75
19 1583762.75
20 1502630.5
21 1427597.875
22 1358164.75
23 1293704.75
24 1233864.75
25 1178325.5
26 1126530.125
27 1078180.25
28 1032857.3125
29 990372.5625
30 950557.875
31 913147.9375
32 877964.125
33 844820.5
34 813598.5
35 784142.625
36 756308.25
37 729942.75
38 704930.3125
39 681209.625
40 658620.375
41 637172.5625
42 616792.75
43 597409.75
44 578932.375
45 561298.625
46 544516.6875
47 528491.125
48 513153.875
49 498518.59375
50 484518.375
51 471135.03125
52 458341.625
53 446055.90625
54 434262.375
55 422929.40625
56 412054.09375
57 401606.375
58 391544.8125
59 381855.1875
60 372519.125
61 363538.9375
62 354895.8125
63 346579.03125
64 338544.8125
65 330788.375
66 323299.15625
67 316044.21875
68 309045.1875
69 302288.40625
70 295752.6875
7

628 6640.92626953125
629 6621.79345703125
630 6602.734375
631 6583.7783203125
632 6564.89599609375
633 6546.09130859375
634 6527.37744140625
635 6508.732421875
636 6490.16796875
637 6471.6943359375
638 6453.29736328125
639 6434.9736328125
640 6416.7412109375
641 6398.580078125
642 6380.49072265625
643 6362.486328125
644 6344.572265625
645 6326.7490234375
646 6308.9697265625
647 6291.275390625
648 6273.65625
649 6256.138671875
650 6238.65087890625
651 6221.2587890625
652 6203.9365234375
653 6186.69189453125
654 6169.509765625
655 6152.40576171875
656 6135.369140625
657 6118.43798828125
658 6101.5517578125
659 6084.7236328125
660 6067.9775390625
661 6051.31396484375
662 6034.71630859375
663 6018.1689453125
664 6001.705078125
665 5985.314453125
666 5968.99267578125
667 5952.7236328125
668 5936.5341796875
669 5920.4267578125
670 5904.37158203125
671 5888.4091796875
672 5872.5
673 5856.66259765625
674 5840.88427734375
675 5825.1708984375
676 5809.51708984375
677 5793.931640625
678 5778.4062

1288 1753.0953369140625
1289 1750.5185546875
1290 1747.9495849609375
1291 1745.3863525390625
1292 1742.82861328125
1293 1740.2767333984375
1294 1737.72900390625
1295 1735.1865234375
1296 1732.654052734375
1297 1730.12353515625
1298 1727.59619140625
1299 1725.075927734375
1300 1722.5648193359375
1301 1720.0557861328125
1302 1717.5531005859375
1303 1715.0545654296875
1304 1712.56201171875
1305 1710.0743408203125
1306 1707.59326171875
1307 1705.1171875
1308 1702.6468505859375
1309 1700.1806640625
1310 1697.7218017578125
1311 1695.2672119140625
1312 1692.817138671875
1313 1690.372802734375
1314 1687.9344482421875
1315 1685.500244140625
1316 1683.0906982421875
1317 1680.6859130859375
1318 1678.2855224609375
1319 1675.890869140625
1320 1673.500732421875
1321 1671.11474609375
1322 1668.73681640625
1323 1666.362548828125
1324 1663.9923095703125
1325 1661.6278076171875
1326 1659.2679443359375
1327 1656.9149169921875
1328 1654.566162109375
1329 1652.221923828125
1330 1649.8831787109375
1331 1647

1934 812.4495849609375
1935 811.6617431640625
1936 810.87646484375
1937 810.091796875
1938 809.3092041015625
1939 808.527099609375
1940 807.7461547851562
1941 806.9656372070312
1942 806.1878051757812
1943 805.4103393554688
1944 804.6337890625
1945 803.858642578125
1946 803.0856323242188
1947 802.3126220703125
1948 801.5418701171875
1949 800.7721557617188
1950 800.00341796875
1951 799.2356567382812
1952 798.4690551757812
1953 797.7034912109375
1954 796.9393310546875
1955 796.1763916015625
1956 795.414794921875
1957 794.654541015625
1958 793.8958740234375
1959 793.1380004882812
1960 792.38134765625
1961 791.6251220703125
1962 790.8704223632812
1963 790.1167602539062
1964 789.3638305664062
1965 788.6136474609375
1966 787.8631591796875
1967 787.1139526367188
1968 786.3663330078125
1969 785.6204833984375
1970 784.8746948242188
1971 784.1306762695312
1972 783.3883056640625
1973 782.646240234375
1974 781.905517578125
1975 781.1657104492188
1976 780.4275512695312
1977 779.6901245117188
1978 77

2594 480.3865966796875
2595 480.0800476074219
2596 479.77386474609375
2597 479.46826171875
2598 479.16278076171875
2599 478.8578796386719
2600 478.5539245605469
2601 478.24957275390625
2602 477.9461364746094
2603 477.6429138183594
2604 477.3397521972656
2605 477.0370178222656
2606 476.7350158691406
2607 476.4331359863281
2608 476.13165283203125
2609 475.8306579589844
2610 475.52984619140625
2611 475.22991943359375
2612 474.9302673339844
2613 474.6304931640625
2614 474.3314208984375
2615 474.03271484375
2616 473.7341003417969
2617 473.4359436035156
2618 473.1387634277344
2619 472.8414306640625
2620 472.544677734375
2621 472.2484130859375
2622 471.9524841308594
2623 471.6571044921875
2624 471.36175537109375
2625 471.0670471191406
2626 470.77264404296875
2627 470.4785461425781
2628 470.1843566894531
2629 469.890869140625
2630 469.5980529785156
2631 469.30572509765625
2632 469.0135498046875
2633 468.72174072265625
2634 468.4300537109375
2635 468.1390380859375
2636 467.84814453125
2637 467.

3260 341.0398254394531
3261 340.90167236328125
3262 340.7636413574219
3263 340.62548828125
3264 340.48779296875
3265 340.3502197265625
3266 340.212646484375
3267 340.07550048828125
3268 339.9383239746094
3269 339.80133056640625
3270 339.66461181640625
3271 339.5278625488281
3272 339.3911437988281
3273 339.25482177734375
3274 339.1184997558594
3275 338.9822998046875
3276 338.8464660644531
3277 338.7105712890625
3278 338.5751037597656
3279 338.4396057128906
3280 338.30426025390625
3281 338.1688537597656
3282 338.0338134765625
3283 337.8990783691406
3284 337.76422119140625
3285 337.6295471191406
3286 337.49505615234375
3287 337.36090087890625
3288 337.2268371582031
3289 337.0927734375
3290 336.9588317871094
3291 336.8255920410156
3292 336.691650390625
3293 336.55828857421875
3294 336.4249267578125
3295 336.2918701171875
3296 336.15875244140625
3297 336.0261535644531
3298 335.8934631347656
3299 335.76080322265625
3300 335.6286315917969
3301 335.496337890625
3302 335.36407470703125
3303 335

3911 276.0351257324219
3912 275.96466064453125
3913 275.8941955566406
3914 275.8233947753906
3915 275.75299072265625
3916 275.68255615234375
3917 275.61236572265625
3918 275.54193115234375
3919 275.4718322753906
3920 275.4017333984375
3921 275.33172607421875
3922 275.2615661621094
3923 275.191650390625
3924 275.1219177246094
3925 275.0521240234375
3926 274.9822998046875
3927 274.9125061035156
3928 274.84295654296875
3929 274.7734680175781
3930 274.7041015625
3931 274.6346435546875
3932 274.5653076171875
3933 274.49615478515625
3934 274.4268798828125
3935 274.3578186035156
3936 274.28863525390625
3937 274.21966552734375
3938 274.1506652832031
3939 274.081787109375
3940 274.0131530761719
3941 273.9443664550781
3942 273.87554931640625
3943 273.8070373535156
3944 273.73870849609375
3945 273.66998291015625
3946 273.60162353515625
3947 273.5334167480469
3948 273.46490478515625
3949 273.396728515625
3950 273.32861328125
3951 273.2604675292969
3952 273.19232177734375
3953 273.1243591308594
395

4548 241.88882446289062
4549 241.8484649658203
4550 241.8083038330078
4551 241.7682647705078
4552 241.7283935546875
4553 241.68820190429688
4554 241.64828491210938
4555 241.60830688476562
4556 241.56832885742188
4557 241.52847290039062
4558 241.48843383789062
4559 241.44879150390625
4560 241.40902709960938
4561 241.369140625
4562 241.32952880859375
4563 241.28968811035156
4564 241.25015258789062
4565 241.21026611328125
4566 241.17086791992188
4567 241.13128662109375
4568 241.0916748046875
4569 241.0520782470703
4570 241.0125732421875
4571 240.97337341308594
4572 240.93394470214844
4573 240.89459228515625
4574 240.85507202148438
4575 240.81582641601562
4576 240.7764434814453
4577 240.7373046875
4578 240.6981201171875
4579 240.6588897705078
4580 240.6197509765625
4581 240.58042907714844
4582 240.54161071777344
4583 240.50244140625
4584 240.46343994140625
4585 240.42446899414062
4586 240.3854522705078
4587 240.34649658203125
4588 240.3076171875
4589 240.26856994628906
4590 240.22981262207

5233 220.53732299804688
5234 220.51339721679688
5235 220.4892578125
5236 220.4654083251953
5237 220.44137573242188
5238 220.4175262451172
5239 220.39364624023438
5240 220.36953735351562
5241 220.3458251953125
5242 220.32211303710938
5243 220.2982177734375
5244 220.2743682861328
5245 220.25045776367188
5246 220.22662353515625
5247 220.20298767089844
5248 220.17909240722656
5249 220.15536499023438
5250 220.13157653808594
5251 220.10781860351562
5252 220.08433532714844
5253 220.06057739257812
5254 220.0368194580078
5255 220.01329040527344
5256 219.98974609375
5257 219.96604919433594
5258 219.9423828125
5259 219.918701171875
5260 219.89524841308594
5261 219.87179565429688
5262 219.84828186035156
5263 219.82460021972656
5264 219.80125427246094
5265 219.77743530273438
5266 219.75424194335938
5267 219.73056030273438
5268 219.70718383789062
5269 219.683837890625
5270 219.66036987304688
5271 219.6370086669922
5272 219.61349487304688
5273 219.5902099609375
5274 219.56698608398438
5275 219.543701

5587 212.97052001953125
5588 212.95150756835938
5589 212.93252563476562
5590 212.91355895996094
5591 212.89453125
5592 212.87562561035156
5593 212.8566436767578
5594 212.83770751953125
5595 212.81878662109375
5596 212.7999725341797
5597 212.78109741210938
5598 212.76220703125
5599 212.74327087402344
5600 212.72439575195312
5601 212.70550537109375
5602 212.6866455078125
5603 212.6679229736328
5604 212.64889526367188
5605 212.63031005859375
5606 212.61131286621094
5607 212.59259033203125
5608 212.5738983154297
5609 212.5550537109375
5610 212.53639221191406
5611 212.51751708984375
5612 212.49899291992188
5613 212.48016357421875
5614 212.46139526367188
5615 212.4427032470703
5616 212.42410278320312
5617 212.40550231933594
5618 212.38681030273438
5619 212.36810302734375
5620 212.349365234375
5621 212.33087158203125
5622 212.3121795654297
5623 212.29359436035156
5624 212.27503967285156
5625 212.25636291503906
5626 212.2379150390625
5627 212.2193603515625
5628 212.2008819580078
5629 212.18228

6044 205.37271118164062
6045 205.358154296875
6046 205.34353637695312
6047 205.3289794921875
6048 205.3145751953125
6049 205.300048828125
6050 205.2855224609375
6051 205.2710418701172
6052 205.2565155029297
6053 205.2421112060547
6054 205.2276153564453
6055 205.2132110595703
6056 205.19873046875
6057 205.18438720703125
6058 205.16981506347656
6059 205.15550231933594
6060 205.14111328125
6061 205.12661743164062
6062 205.11231994628906
6063 205.09788513183594
6064 205.08348083496094
6065 205.0691375732422
6066 205.0548095703125
6067 205.04049682617188
6068 205.02601623535156
6069 205.01177978515625
6070 204.99746704101562
6071 204.98318481445312
6072 204.96875
6073 204.95452880859375
6074 204.9401092529297
6075 204.92584228515625
6076 204.91165161132812
6077 204.89743041992188
6078 204.88308715820312
6079 204.8687744140625
6080 204.8545379638672
6081 204.8401336669922
6082 204.82618713378906
6083 204.81179809570312
6084 204.7977294921875
6085 204.78353881835938
6086 204.769287109375
6087

6668 197.6268310546875
6669 197.6162872314453
6670 197.60569763183594
6671 197.5950927734375
6672 197.58445739746094
6673 197.57391357421875
6674 197.56346130371094
6675 197.55282592773438
6676 197.54232788085938
6677 197.53167724609375
6678 197.52122497558594
6679 197.5106964111328
6680 197.50015258789062
6681 197.4896240234375
6682 197.47891235351562
6683 197.46847534179688
6684 197.45790100097656
6685 197.447509765625
6686 197.43685913085938
6687 197.42649841308594
6688 197.416015625
6689 197.40548706054688
6690 197.39511108398438
6691 197.3843994140625
6692 197.37405395507812
6693 197.36363220214844
6694 197.35316467285156
6695 197.34275817871094
6696 197.33229064941406
6697 197.32188415527344
6698 197.31137084960938
6699 197.3009033203125
6700 197.29042053222656
6701 197.280029296875
6702 197.2694854736328
6703 197.25914001464844
6704 197.24867248535156
6705 197.23831176757812
6706 197.2278289794922
6707 197.21754455566406
6708 197.20700073242188
6709 197.19662475585938
6710 197.1

7316 191.62765502929688
7317 191.61961364746094
7318 191.61155700683594
7319 191.60340881347656
7320 191.59518432617188
7321 191.58724975585938
7322 191.57911682128906
7323 191.5709686279297
7324 191.5628662109375
7325 191.55470275878906
7326 191.54676818847656
7327 191.53857421875
7328 191.5304412841797
7329 191.52236938476562
7330 191.51425170898438
7331 191.50625610351562
7332 191.4980926513672
7333 191.48992919921875
7334 191.48193359375
7335 191.47389221191406
7336 191.4658966064453
7337 191.45765686035156
7338 191.44955444335938
7339 191.4415740966797
7340 191.4335479736328
7341 191.4254150390625
7342 191.4173583984375
7343 191.40940856933594
7344 191.40130615234375
7345 191.39332580566406
7346 191.38514709472656
7347 191.37710571289062
7348 191.36917114257812
7349 191.36106872558594
7350 191.35304260253906
7351 191.34503173828125
7352 191.33702087402344
7353 191.32896423339844
7354 191.3209686279297
7355 191.3129119873047
7356 191.30502319335938
7357 191.2969207763672
7358 191.2

7975 186.825439453125
7976 186.81884765625
7977 186.8123016357422
7978 186.80575561523438
7979 186.79922485351562
7980 186.79254150390625
7981 186.78610229492188
7982 186.77951049804688
7983 186.77296447753906
7984 186.7664031982422
7985 186.760009765625
7986 186.75344848632812
7987 186.74688720703125
7988 186.7402801513672
7989 186.7336883544922
7990 186.72727966308594
7991 186.7206268310547
7992 186.7141571044922
7993 186.70761108398438
7994 186.70098876953125
7995 186.694580078125
7996 186.68795776367188
7997 186.68153381347656
7998 186.67498779296875
7999 186.66851806640625
8000 186.66195678710938
8001 186.65536499023438
8002 186.64881896972656
8003 186.6423797607422
8004 186.6358184814453
8005 186.62924194335938
8006 186.62281799316406
8007 186.61630249023438
8008 186.60980224609375
8009 186.60342407226562
8010 186.5968475341797
8011 186.59031677246094
8012 186.5839080810547
8013 186.5773468017578
8014 186.57080078125
8015 186.56448364257812
8016 186.5579071044922
8017 186.5513153

8656 182.73117065429688
8657 182.72561645507812
8658 182.7200927734375
8659 182.714599609375
8660 182.70913696289062
8661 182.7035369873047
8662 182.69796752929688
8663 182.6925811767578
8664 182.6869659423828
8665 182.68145751953125
8666 182.67581176757812
8667 182.6703643798828
8668 182.6647491455078
8669 182.65924072265625
8670 182.6537322998047
8671 182.64828491210938
8672 182.64285278320312
8673 182.6372833251953
8674 182.63169860839844
8675 182.626220703125
8676 182.62074279785156
8677 182.61514282226562
8678 182.6096954345703
8679 182.60418701171875
8680 182.59878540039062
8681 182.59317016601562
8682 182.58767700195312
8683 182.58213806152344
8684 182.5767059326172
8685 182.5711212158203
8686 182.5655975341797
8687 182.5602264404297
8688 182.55459594726562
8689 182.54920959472656
8690 182.5437469482422
8691 182.5382080078125
8692 182.53265380859375
8693 182.5271759033203
8694 182.5216827392578
8695 182.5162353515625
8696 182.51071166992188
8697 182.50527954101562
8698 182.49974

9349 179.15530395507812
9350 179.15045166015625
9351 179.1456298828125
9352 179.14089965820312
9353 179.1360321044922
9354 179.1311492919922
9355 179.12646484375
9356 179.12167358398438
9357 179.11654663085938
9358 179.11190795898438
9359 179.10720825195312
9360 179.10215759277344
9361 179.09735107421875
9362 179.09251403808594
9363 179.08773803710938
9364 179.08290100097656
9365 179.0782470703125
9366 179.07327270507812
9367 179.06846618652344
9368 179.06362915039062
9369 179.05882263183594
9370 179.053955078125
9371 179.0491485595703
9372 179.0443572998047
9373 179.03958129882812
9374 179.03472900390625
9375 179.0299835205078
9376 179.02503967285156
9377 179.0202178955078
9378 179.01551818847656
9379 179.0106201171875
9380 179.00579833984375
9381 179.00096130371094
9382 178.9962921142578
9383 178.99148559570312
9384 178.98651123046875
9385 178.98182678222656
9386 178.97702026367188
9387 178.97218322753906
9388 178.96739196777344
9389 178.96255493164062
9390 178.95794677734375
9391 17

In [73]:
for iteration in range(10000):
    # Forward pass: compute predicted y using operations on Tensors. Since w1 and
    # w2 have requires_grad=True, operations involving these Tensors will cause
    # PyTorch to build a computational graph, allowing automatic computation of
    # gradients. Since we are no longer implementing the backward pass by hand we
    # don't need to keep references to intermediate values.
    # predict the values by multiplying x with weight matrix w1, then apply RELU activation and multiply the result by weight matrix w2
    y_pred = F.relu(x.mm(w1).clamp(min=0)).mm(w2)# your code here ; the final prediction is given by matrix multiplying the data 
    #with the two set of weights, making the intermediate values non-negative (RELU activation function)

    # calculate the mean squared error (MSE)
    error =  ((y_pred - y) ** 2).sum()

    
    writer.add_scalar(tag="Last run",scalar_value= error, global_step = iteration)
    writer.add_histogram("error distribution",error)
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
   
    error.backward()
    # Update weights using gradient descent. For this step we just want to mutate
    # the values of w1 and w2 in-place; we don't want to build up a computational
    # graph for the update steps, so we use the torch.no_grad() context manager
    # to prevent PyTorch from building a computational graph for the updates
    
    
    with torch.no_grad():
        # use w1.grad to update w2 according to the gradient descent formula
        # use w2.grad to update w2 according to the gradient descent formula
        # also use the learning_rate you set before!
        w1 -= learning_rate * w1.grad# your code here
        w2 -= learning_rate * w2.grad# your code here
        
    if iteration % 50 == 0:
        print("Iteration: %d - Error: %.4f" % (iteration, error))
        w1_array.append(w1.cpu().detach().numpy())
        w2_array.append(w2.cpu().detach().numpy())
        errors.append(error.cpu().detach().numpy())
    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()
    if error < 1e-6:
        print("Stopping gradient descent, algorithm converged, MSE loss is smaller than 1E-6")
        break
        
writer.close()

Iteration: 0 - Error: 176.1653
Iteration: 50 - Error: 175.9473
Iteration: 100 - Error: 175.7306
Iteration: 150 - Error: 175.5151
Iteration: 200 - Error: 175.3012
Iteration: 250 - Error: 175.0884
Iteration: 300 - Error: 174.8772
Iteration: 350 - Error: 174.6671
Iteration: 400 - Error: 174.4583
Iteration: 450 - Error: 174.2506
Iteration: 500 - Error: 174.0444
Iteration: 550 - Error: 173.8391
Iteration: 600 - Error: 173.6351
Iteration: 650 - Error: 173.4321
Iteration: 700 - Error: 173.2303
Iteration: 750 - Error: 173.0293
Iteration: 800 - Error: 172.8297
Iteration: 850 - Error: 172.6310
Iteration: 900 - Error: 172.4333
Iteration: 950 - Error: 172.2363
Iteration: 1000 - Error: 172.0405
Iteration: 1050 - Error: 171.8457
Iteration: 1100 - Error: 171.6517
Iteration: 1150 - Error: 171.4587
Iteration: 1200 - Error: 171.2666
Iteration: 1250 - Error: 171.0753
Iteration: 1300 - Error: 170.8851
Iteration: 1350 - Error: 170.6954
Iteration: 1400 - Error: 170.5068
Iteration: 1450 - Error: 170.3191
Ite

# Pytorch Homework Solutions

## Solution for Homework 1

In [None]:
# https://docs.python.org/3/tutorial/errors.html#raising-exceptions to read more about Exception handling
def location_indicator(tensor_):
    indicatorstring = "CUDA" if tensor_.device.type == "cuda" else "CPU"
    print(f"The value of tensor_ is {tensor_} and the tensor location type is {indicatorstring}")
    return indicatorstring

def try_adding_block(x,y,only_convert_one=True):
    print(f"x is the following {x}")
    print(f"y is the following {y}")
    try:
        if x.device.type == y.device.type:    
            if only_convert_one == False:
                z = x.type(torch.DoubleTensor) + y.type(torch.DoubleTensor)
                print("Adding succeeded, objects are in the same memory type") 
            else :
                try :
                    z = x.type(torch.DoubleTensor) + y
                except TypeError:
                    print("Unhandled error thrown because the tensors are of different type!")
                    raise TypeError("Unhandled error thrown because the tensors are of different type!")

                
        else :
             raise TypeError("Adding on different memory banks is not allowed, will result in TypeError!")
            
    except TypeError:
        print("The additives are of different type, addition not implemented for different types of tensors!")
    
    else :
        print("No exception thrown!")
    finally:
        print("End of the function")
                
def try_adding_different_locations(x = torch.ones(3,device="cpu"),\
                                   device = torch.device("cuda"),notboth=True,output_type = "cpu"):
    """
    First a tensor x is created for CPU and the default device is set to be CUDA.
    Then
    """
    if device :
        indicatorstring = location_indicator(tensor_=x)
        if indicatorstring == "CPU":
              x = x.to("cuda", torch.double)     # ``.to`` can also change dtype together!
              print(f"Before the Device type was CPU, but now it is {x.device.type}")
        y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
        print("First we will enforce the data type of both tensors, adding is going to work!")
        try_adding_block(x,y,only_convert_one=False) # convert both to the same type, adding works
        print("\n")
        
        indicatorstring_cuda = location_indicator(tensor_=x)
        if indicatorstring_cuda == "CUDA":
            x = x.to("cpu", torch.double)     # ``.to`` can also change dtype together!
            print(f"Before the Device type was CUDA, but now it is {x.device.type}")
        
       
        indicatorstring_final = location_indicator(tensor_=x) # here the memory type is CPU for one
        
        print("\n")
        try_adding_block(x,y,only_convert_one = False) # adding doesn't work, different memory locations
        
        print("\n")
        print("Now we are enforcing the data type of only one tensor, adding is not going to work!")
        try_adding_block(x,y,only_convert_one = notboth) # converting only one , adding doesnt work
        
        print("\n\n")
        
        # After we convert the Memory type to CUDA for both, adding will work:
        x = x.to("cuda", torch.double)
        try_adding_block(x,y,only_convert_one = False)

    else :
        print("To run this section, please install CUDA as described in the Setting up Pytorch section")
   
    print("Program ended!")

try_adding_different_locations()

In [None]:
Image("assets/img/wikipedia_example_notation.png")

# THIS IS HOW YOU SET UP AN OPTIMIATION PROBLEM

# Gradient Descent in Python :
## Effect of Parameters in Gradient Descent

In [None]:
# Assume that we have been given a generic two variable polynomial function
def two_variable_function(x, y):
    z = x**3 + 2*(x*y) + 3*(y**2) 
    return z

Our goal is to find the global minimum of this function within a specified rectangle from -10 to 10

In [None]:
boundary_grid_values  = [two_variable_function(-10,-10),two_variable_function(-10,10),\
                         two_variable_function(10,-10),two_variable_function(10,10)]
local_extrema_values = [two_variable_function(0,0),two_variable_function(6/27,-2/27)]
min(boundary_grid_values)
#np.min(np.array([boundary_grid_values,two_variable_function(0,0),two_variable_function(6/27,-2/27)]))
if (min(boundary_grid_values) == min(min(local_extrema_values),min(boundary_grid_values))) == True:
    print(f"The minimum amongst the evaluated points is {min(boundary_grid_values)}")

Analyzing this function, we get two stationary points
$(x,y) = (0,0)$ and $(x,y) = (6/27,-2/27)$, since the first derivatives give:

In [None]:
from sympy import symbols
x,y = symbols('x y')
# z = x^3 + 2xy + 3y^2
z = two_variable_function(x, y)
derivatives = z.diff(x,1),z.diff(y,1)
print(derivatives)
# derivatives
# dz/dx = 3*coefficients[0]*x**2 + coefficients[1]*y 
# dz/dy = coefficients[1]*x + 2*coefficients[2]*y
# https://docs.sympy.org/latest/tutorial/calculus.html use that to verify



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

startgrid=-10
endgrid=10.05
a = np.arange(startgrid, endgrid, 0.05)
b = np.arange(startgrid, endgrid, 0.05)

x, y = np.meshgrid(a, b) # creating the evaluation domain grid.
# NB! If the global optimum of the function lies outside the grid, 
# this global optimum would never be discovered since the cost function would never be evaluated there.

z = two_variable_function(x, y)


fig, ax = plt.subplots()
z_min, z_max = z.min(),z.max()
print(z.min())

c = ax.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
ax.set_title('Objective Function Values HeatMap')
# set the limits of the plot to the limits of the data
ax.axis([x.min(), x.max(), y.min(), y.max()])
fig.colorbar(c, ax=ax)
plt.show()

# unravel_index does the inverse. Given a linear index, it computes the corresponding ND index. 
# Since this depends on the block dimensions, these also have to be passed
(x_min_idx,y_min_idx) = np.unravel_index(np.argmin(z), z.shape)

print(f"y minimum location is  {a[x_min_idx]}")
print(f"x minimum location is  {b[y_min_idx]}")
#l2 = b[y_min_idx]
#ax.text(-5, -7.5, l1,fontsize=14)
plt.show()
#ax.legend(loc = 1)


Thus we see that the minimum value is -1033, and that this happens in a semiellipse close to the origin.

Let's see if we get close to -1000 also with gradient descent.

In [None]:
%%time 
#!yes | conda install -n dl -c conda-forge matplotlib   -- to install matplotlib into conda env dl
from IPython.display import HTML
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from sympy import *
import random

# define a 2-variable function z = f(x,y)
def two_variable_function(x, y):
    z = x**3 + 2*x*y + 3*(y**2) 
    return z

def gradient_descent(x_start, y_start, learning_rate, epochs):
    """
    Each following step of the gradient descent depends on the result of the previous step.
    """
    # initialize the grid values as empty lists for the variables
    
    # first run
    x = y = z = []
    x_old = x_start
    y_old = y_start
    
    x.append(x_old)
    y.append(y_old)
    z_gd = two_variable_function(x_old, y_old)
    z.append(z_gd)
    
    # further runs

    # begin the loops to update x, y and z
    for i in range(epochs):
        x_gd = x_old - learning_rate*(3*x_old**2 + 2*y_old)
        y_gd = y_old - learning_rate*(2*x_old + 6*y_old)
        x.append(x_gd)
        y.append(y_gd)
        #print(x)
        #print(y)
        #print(two_variable_function(x, y))
        z_gd = two_variable_function(x_gd, y_gd)
        z.append(z_gd)  # appending the values for z
        # for the next iteration, the new values will be the old values
        x_old = x_gd
        y_old = y_gd

    return x_gd, y_gd, z_gd


####### GIVING THE INITIAL VALUES FOR THE GRADIENT DESCENT

xstart = -1
ystart = -0.5


whatindex = -1
def precpr(x,prec=3):
    return round(x,prec)

lr = np.linspace(0.001,1,7)
xs = np.linspace(-9.9,9.9,4)
ys = xs
idx = 0
best_triplet = ''
best_loss = 1e6
best_z = 1e6
best_x = 1e6
best_y = 1e6
z = two_variable_function(xs,ys)
print(f"Before starting gradient descent, objective function value was: {precpr(two_variable_function(x=xs[0],y=ys[0]))}")

for l in lr:
    for xstart in xs:
        for ystart in ys:
            epochs = random.randint(30,100)
            x_gd, y_gd, z_gd = gradient_descent(x_start=xstart, y_start=ystart, learning_rate=l, epochs = epochs)
            
            #print("\n")
            #print("Last value of x is")
            #print(x_gd[-1])
            if abs((z_gd)-(-1033.33))<best_loss:
                best_loss = abs(z_gd -(-1033.33))
                best_triplet = f"{xstart}_{ystart}_{l}"
                whatindex = idx
                best_x = x_gd
                best_y = y_gd
                best_z = z_gd
            print("\n")
            print(f"Iteration {idx},CONFIGURATION l:{l},xstart:{xstart},ystart:{ystart}")
            print("\n")
            print(f"After gradient descent of {epochs} epochs, the values are:")
            print(f"At the optimum, the objective function value is {precpr(best_z)}")
            print(f"At the optimum, the value of x is {precpr(best_x)} and the value of y is {precpr(best_y)}")
            idx += 1

print(f"The optimal loss was achieved in iteration {whatindex}")
print(f"The found loss coordinates (x,y) are ({precpr(best_x)},{precpr(best_y)})")


We note that sometimes gradient descent might converge to a solution outside the feasible region.
This is why in neural networks, it is important to understand whether the optimizing algorithm used considers the problem from unconstrained or constrained optimization context.

In [None]:
best_loss

## Gradient Descent : Full Example with Dynamic Visualization

In [None]:

def gradient_descent_demo(x_start, y_start, learning_rate, epochs):
    """
    Each following step of the gradient descent depends on the result of the previous step.
    """
    def two_variable_demo_function(x, y):
        #z = -x**4 + 2*(x*y) + 3*(y**2) 
        z = x**2 + 2*(x*y) + 3*(y**2) 

        return z
    # initialize the grid values as empty lists for the variables
    
    # first run
    x_gd=y_gd = z_gd = []
    x_old = x_start
    y_old = y_start
    
    x_gd.append(x_old)
    y_gd.append(y_old)
    z_gd.append(two_variable_demo_function(x_old, y_old))
    
    # further runs

    # begin the loops to update x, y and z
    for i in range(epochs):
        x = x_old - learning_rate*(2*x_old+2*y_old)
        y = y_old - learning_rate*(2*x_old+9*y_old)
        x_gd.append(x)
        y_gd.append(y)
        z_gd.append(two_variable_demo_function(x, y))  # appending the values for z
        # for the next iteration, the new values will be the old values
        x_old = x
        y_old = y

    return x_gd, y_gd, z_gd

x_gd, y_gd, z_gd = gradient_descent_demo(x_start=0.5, y_start=0.3, learning_rate=0.02, epochs = 20)



In [None]:
startgrid=-2
endgrid=2.05
a = np.arange(startgrid, endgrid, 0.05)
b = np.arange(startgrid, endgrid, 0.05)

x, y = np.meshgrid(a, b) # creating the evaluation domain grid.

def two_variable_demo_function(x, y):
    z = x**2 + 2*(x*y) + 3*(y**2) 
    return z
z = two_variable_demo_function(x, y)

# FIND THE ACTUAL MIN coordinates for x and y:
(x_min_idx,y_min_idx) = np.unravel_index(np.argmin(z), z.shape)
# Actual minimum coordinate values
print(x_min_idx)
print(a[x_min_idx])
print(b[y_min_idx])

Thus we see that the global minimum of that convex function in that region is located at $(x,y)=(0,0)$. 

In [None]:
x_gd, y_gd, z_gd = gradient_descent_demo(x_start=0.5, y_start=0.3, learning_rate=0.14, epochs = 10)

############ INITIALIZING THE PLOTTING SYSTEM ###############
def init():
    line.set_data([], [])
    point.set_data([], [])
    value_display.set_text('')

    return line, point, value_display

def animate(i):
    # Animate line
    line.set_data(x_gd[:i], y_gd[:i])
    
    # Animate points
    point.set_data(x_gd[i], y_gd[i])

    # Animate value display
    value_display.set_text('Min = ' + str(z_gd[i]))

    return line, point, value_display

##############################################################



fig1, ax1 = plt.subplots()

ax1.contour(x, y, z, levels=np.logspace(startgrid, endgrid, 15), cmap='CMRmap')
# Plot target (the minimum of the function)

# PLOT THE ACTUAL MIN POINT 
min_point = np.array([0., 0.])
min_point_ = min_point[:, np.newaxis]

ax1.plot(*min_point_, two_variable_demo_function(*min_point_), 'r*', markersize=10)
ax1.set_xlabel(r'x')
ax1.set_ylabel(r'y')
''' Animation '''
# Create animation
line, = ax1.plot([], [], 'r', label = 'Gradient Descent on Convex Function', lw = 2.0)
point, = ax1.plot([], [], 'bo')
value_display = ax1.text(0.02, 0.02, '', transform=ax1.transAxes)

ax1.legend(loc = 1)

anim = animation.FuncAnimation(fig1, animate, init_func=init,
                               frames=len(x_gd), interval=120, 
                               repeat_delay=60, blit=True)

HTML(anim.to_jshtml())

So we have seen, why gradient descent is not the most optimal optimizer:
    - Finds local optima, not global. Always think whether the optimization problems is convex or concave.
    - Only gradient descent itself means **unconstrained** optimization. If there are constraints to the domain, then gradient descent doesn't follow  those!  If we have a continuous function on some domain, then the boundary values have to be checked because the cost function value at those might be more optial
    - The optimization may not converge, it may start oscillating or diverge
        - Choice of step size is crucial. Too big step size can result in exploding gradient, too small step size in vanishing gradient
    - Using only gradient information, we have only 1st order information about the function. There are 2nd order methods such as L-BFGS, Conjugate gradient and Newton method that have improved convergence properties, but at higher computational cost

## Binary Classification Lost Function


Choosing the right cost function for achieving the desired result is a critical point of machine learning problems. The basic approach, if you do not know exactly what you want out of your method, is to use [Mean Square Error (Wikipedia)](https://en.wikipedia.org/wiki/Mean_squared_error) for regression problems and Percentage of error for classification problems. However, if you want _good_ results out of your method, you need to _define good_, and thus define the adequate cost function. This comes from both domain knowledge (what is your data, what are you trying to achieve), and knowledge of the tools at your disposal. 

I do not believe I can guide you through the cost functions already implemented in TensorFlow, as I have very little knowledge of the tool, but I can give you an example on how to write and assess different cost functions.

---

To illustrate the various differences between cost functions, let us use the example of the binary classification problem, where we want, for each sample $x_n$, the class $f(x_n) \in \{0,1\}$.

Starting with **computational properties**; how two functions measuring the "same thing" could lead to different results. Take the following, simple cost function; the percentage of error. If you have $N$ samples, $f(y_n)$ is the predicted class and $y_n$ the true class, you want to minimize

* $\frac{1}{N} \sum_n \left\{
\begin{array}{ll}
1 & \text{ if } f(x_n) \not= y_n\\
0 & \text{ otherwise}\\
\end{array} \right. = \sum_n y_n[1-f(x_n)] + [1-y_n]f(x_n)$.

This cost function has the benefit of being easily interpretable. However, it is not smooth; if you have only two samples, the function "jumps" from 0, to 0.5, to 1. This will lead to inconsistencies if you try to use gradient descent on this function. One way to avoid it is to change the cost function to use probabilities of assignment; $p(y_n = 1 | x_n)$. The function becomes

* $\frac{1}{N} \sum_n y_n p(y_n = 0 | x_n) + (1 - y_n) p(y_n = 1 | x_n)$.

This function is smoother, and will work better with a gradient descent approach. You will get a 'finer' model. However, it has other problem; if you have a sample that is ambiguous, let say that you do not have enough information to say anything better than $p(y_n = 1 | x_n) = 0.5$. Then, using gradient descent on this cost function will lead to a model which increases this probability as much as possible, and thus, maybe, overfit.

Another problem of this function is that if $p(y_n = 1 | x_n) = 1$ while $y_n = 0$, you are certain to be right, but you are wrong. In order to avoid this issue, you can take the log of the probability, $\log p(y_n | x_n)$. As $\log(0) = \infty$ and $\log(1) = 0$, the following function does not have the problem described in the previous paragraph:

* $\frac{1}{N} \sum_n y_n \log p(y_n = 0 | x_n) + (1 - y_n) \log p(y_n = 1 | x_n)$.

This should illustrate that in order to optimize the _same thing_, the percentage of error, different definitions might yield different results if they are easier to make sense of, computationally.

**It is possible for cost functions $A$ and $B$ to measure the _same concept_, but $A$ might lead your method to better results than $B$.**

---
In conclusion, defining the cost function is defining the goal of your algorithm. The algorithm defines how to get there.


### What is the relation between backpropagation and Auto-differentiation

    When applying backpropagation, Gradient descent calculation should be implemented to find minimum loss or error in each loop, auto differentiation constructs this part in place of you no need to how to implement gradient descent.
    Basically backpropagation itself is an optimization technique for neural networks while auto differentiation is a calculus method. When these two method from different field are combined, autograd algorithm is occurred as a result which is better way and easier to implement. 