<a href="https://colab.research.google.com/github/chetansolanke14/NaturalLanguageProcessing/blob/master/PyCon/Part_2_Getting_Started_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 2 - Getting Started with PyTorch

**Required Time:  20 minutes**

This notebook builds on top of the first block, covering basic concepts useful to understand the PyTorch deep learning framework such as objective function, non-linearities, affine maps, etc.

---

### Journey
- Computation Graph
- **Exercise: **Computing the derivative
- Input
- **Exercise:** Tensor transformation
- Linear Tranformation
- Non-Linear Transformation
- **Exercise:** Chaining Linear Layers

### Computation Graph - An Overview
A simplified definition of a neural network is a string of functions that are differentiable and that we can combine together to get more complicated functions. An intuitive way to express this process is through computation graphs. PyTorch provide efficient functionalities for **automatic differentiation**.

![alt txt](https://camo.githubusercontent.com/c665d8f2e5bf67bc93573cc3a7fb7d26028596ca/687474703a2f2f636f6c61682e6769746875622e696f2f706f7374732f323031352d30382d4261636b70726f702f696d672f747265652d6576616c2d6465726976732e706e67)



In [1]:
import torch

### FORWARD
print("---------------------------------------------------")
print("FORWARD: ")

# layer 1
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# layer 2
c = a + b
c.retain_grad() # retrain gradients for non-leaf Tensors

d = b + 1.0
d.retain_grad() # retrain gradients for non-leaf Tensors

# layer 3
e = c * d

print("e: ", e)

### BACKWARD
print("---------------------------------------------------")
print("BACKWARD: ")

e.backward()

print("c.grad: ", c.grad.detach().item()) # de/dc
print("d.grad: ", d.grad.detach().item()) # de/dd

---------------------------------------------------
FORWARD: 
e:  tensor(6., grad_fn=<MulBackward0>)
---------------------------------------------------
BACKWARD: 
c.grad:  2.0
d.grad:  3.0


---
### EXERCISE - Computing the derivative

Compute the derivates of `a` and `b` with respect to c. Just the left part of the figure above.

---

In [0]:
### YOUR CODE HERE


### YOUR CODE HERE

Here is a nice [tutorial](https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec) discussing in detail how gradients are handled and computed in PyTorch.

### Input
The first component of the neural network is the input. Inputs have to be represented in tensor formats as this is the main data structure or representation used in PyTorch. We were introduced to tensors and a few operations that are possible with them in the previous segment of this tutorial. Therefore, we will briefly review different kinds of inputs that are common in NLP. This is the first actual part of the tutorial where we start to introduce concepts related to NLP and how we will integrate them with the other components provided in the PyTorch ecosystem. Inputs can be represented in either scalars, vectors, or multi-dimensional matrices. Whichever the type, they are all represented as tensors in PyTorch. Typically, the inputs are composed from publicly available datasets.

Inputs to an NLP deep learning model are usually of the following dimensions: `batch_size X max_sequence_lenght X vocab_size`. Let's assume our batch size is 6, the sequence length is 60, and vocab size is 10000. Let's see how this looks below:

In [0]:
sample = torch.rand(64, 60, 10000)
print(sample.size())

torch.Size([64, 60, 10000])


The first thing you will notice is the huge size in vocbulary, which is typical when using what's called one-hot encodings. There is also option to encode words and sentences into efficient embeddings. This ensure a more efficient representation as words can be represented to have semantic relationship.

In such case, the dimensions reduced and they typically look like the following:

In [0]:
sample_with_embeddings = torch.rand(64, 60, 100)
print(sample.size())

torch.Size([64, 60, 10000])


Note now that the 3rd dimension has been significantly reduced in dimensoin because we are using embeddings as input to represente sequences. This not only ensure efficiency in terms of meaning buy also the network will train more efficiently because the dimension are reduced.

----
### EXERCISE - Tensor Transformation

Sometimes we need to permute the dimensions of the tensor. How dow we do this in PyTorch? Please visit the PyTorch documentation to find out how to achieve a tranformation of the original size of the tensor. Hint: `A.permute()`. Try to permute the `sample_with_embeddings` above to be of the following dimenions instead: `max_sequence_lenght X batch_size X vocab_size`.

---

In [0]:
### YOUR CODE GOES HERE


### YOUR CODE GOES HERE

### Linear Transformation
A fundamental operation of training a neural network is affine mapping or linear transformations, which is simply a tranformation of a tensor given some function. PyTorch already packages various linear transformations, so we don't need to manually implement them. 

Let's look at the example below. We wish to output the hidden representation using randomly initialized weight and biases. In other words, we wish to compute the following:


$$
y = Wx + b
$$


In [0]:
import torch
import torch.nn as nn

# using Linear unit in PyTorch

sample_x = torch.rand(64, 60, 100)

fc = nn.Linear(100, 50) # Wx + b

# chaining happening automatically
out = fc(sample_x)
print(out.size())

print(sample_x)

torch.Size([64, 60, 50])
tensor([[[0.2218, 0.8871, 0.7491,  ..., 0.2637, 0.0226, 0.5044],
         [0.4265, 0.7755, 0.3018,  ..., 0.0461, 0.7320, 0.3729],
         [0.8693, 0.9141, 0.4076,  ..., 0.2277, 0.2975, 0.9917],
         ...,
         [0.4049, 0.9269, 0.6229,  ..., 0.5807, 0.5033, 0.3958],
         [0.1866, 0.5652, 0.2095,  ..., 0.2580, 0.9312, 0.8549],
         [0.1859, 0.9515, 0.8794,  ..., 0.4411, 0.2722, 0.1611]],

        [[0.9518, 0.6685, 0.6872,  ..., 0.3222, 0.5740, 0.3342],
         [0.3053, 0.3433, 0.8695,  ..., 0.0735, 0.6718, 0.6849],
         [0.2689, 0.7600, 0.0173,  ..., 0.2909, 0.6432, 0.5380],
         ...,
         [0.1556, 0.8891, 0.0601,  ..., 0.2868, 0.2580, 0.2419],
         [0.8643, 0.3586, 0.9642,  ..., 0.9187, 0.3414, 0.5323],
         [0.6241, 0.1440, 0.1963,  ..., 0.1102, 0.8434, 0.0396]],

        [[0.2949, 0.2038, 0.7452,  ..., 0.0172, 0.2478, 0.5980],
         [0.8738, 0.4244, 0.5385,  ..., 0.9600, 0.8836, 0.9857],
         [0.9358, 0.0936, 0.6502,

### Non-linear Transformation

We can then apply a non-linear transformation using the results of the previous linear transformation, computed as follows:

$$
h = sigma(Wx + b)
$$

where sigma refers to the `Sigmoid` activation function in our example below:

In [0]:
sample_x = torch.rand(64, 60, 100)

fc = nn.Linear(100, 50)
sig = nn.Sigmoid()

out = fc(sample_x)
out = sig(out) # [-1, 1]

print(out.size())

torch.Size([64, 60, 50])


There are other popular non-linear transformation or activiation functions available for use such as `RelU` and `tanh`. 

### Softmax Classifier
This component of the neural network is called the classifier, which is usually in charge of making the final prediction via a normalized representation of the output layer. From the equation below you can see that to get this output we just need to apply a softmax function. The values returned will be in the range (0, 1) and sum to 1.

$$
output = softmax (x)
$$

In [0]:
m = nn.Softmax(dim=1)
x = torch.randn(4, 5)
out = m(x)
print(out)

tensor([[0.0635, 0.0867, 0.1149, 0.5673, 0.1677],
        [0.0936, 0.0555, 0.5116, 0.1520, 0.1873],
        [0.1764, 0.3617, 0.0753, 0.1635, 0.2230],
        [0.2519, 0.1645, 0.2851, 0.0536, 0.2448]])



---
### EXERCISE - Chaining Linear Layers

Go ahead and try to chain a few linear transformations, make it deep if you like. Revise the previous notebook to help you build a chain of operations. 

Feel free to explore the PyTorch documentation to familiarize yourself with more of the basic linear and non-linear transformations. In addition, try to change the size of the `Linear` layers and combining a series of them.

---


In [0]:
### YOUR CODE HERE


### YOUR CODE HERE

##  Beyond
Before proceeding to the next section, it would be of great help to review the following notebooks. For the purpose of this tutorial, you can skip this part.

- [NN from scratch with PyTorch](https://colab.research.google.com/drive/109gHWFUlUzuwhgXROpzIuVoSPZA_qeoy) - This notebook shows you how to implement a neural network from scratch using PyTorch
- [RNN from scratch](https://colab.research.google.com/drive/1NVuWLZ0cuXPAtwV4Fs2KZ2MNla0dBUas) - This notebook shows you how to ment recurrent neural networks (RNNs) using PyTorch.


---

### References
- [NN from scratch with PyTorch](https://colab.research.google.com/drive/109gHWFUlUzuwhgXROpzIuVoSPZA_qeoy)
- [RNN from scratch](https://colab.research.google.com/drive/1NVuWLZ0cuXPAtwV4Fs2KZ2MNla0dBUas)
- [Emotion Recognition with PyTorch](https://github.com/omarsar/appworks_meetup_2018/blob/master/Deep%20Learning%20Emotion%20Recognition%20PyTorch.ipynb)