# "Pytorch Topics"

> "Everyday Learnings about ML. Random topics"

- toc: true
- branch: master
- badges: false
- comments: true
- categories: [Machine Learning]
- hide: false
- search_exclude: false
- image: images/post-thumbnails/pytorch.png
- metadata_key1: pytorch
- metadata_key2: 

# Pytorch

**Model Eval**

```python

    with torch.no_grad() :            # Switch off automatic differentation
    output = learn.model.eval()(x)  # Evaluate models by not considering the batch norm, drop out layers etc.
                                        ##Inference mode
```

**Detach**

Remove a tensor (here "x") from the computational graph, which reduces memory foot print.

Detaches and clones a tensor

```python

    x.detach()   
    
    y = x.detach().clone() 

```

**Model**

```python

   learn.model[0]    #model is made up of "sequential" function contaniner.
    
                     # sequential(sequential()) # neural networks stacked on top of each other
    
    

```

**einsum**

einsum or Enstein summation is basically short end notation to express actions on tensors (or typically matrices) like transpose, multiplication, sum, dot product etc. 


```python

torch.einsum ('ab,bc -> ac', A, B)

'ab,bc -> ac'  ==> notation.  a,b,c,d are dimensions. "," means multiply 

You are telling that take 2 inputs of dimensions ab and bc and generate an output that gives ac. 

A,B ==> matrices

```


**Class Activation Map**

**Areas used to determine class = activations of the last layer of conv * weights of the fully connected layer**


- Last layer of activation (before the fully connected layer) shows where the model is focusing.  
- Needs a global average pooling layer in the network (such as RESNET)

- Why before global max pooling layer? 
The global max poolng layer unlike the other maxpool layers will squash all features into 1 linear vector. 
The max pooling layer before the fully connected layer will squash all local activations, normalize them and feed them to FC. Until then you will have localized features that model is looking at. In other words, you will have location of the image where the model is focused on.  

- How does dot product help?

![](https://abhisheksreesaila.github.io/blog/images/general/cnn1.png "CNN - Training Phase")

Learns Features. Stores Weights

![](https://abhisheksreesaila.github.io/blog/images/general/cnn2.png "CNN - Inference Phase")

Use those weights with the activations and figure which one to focus on, which one to omit. 



## Drawbacks

- The architecture needs to have global max pooling layer. Only then can we take the layer before that.
- The method can only look at the final layer of the CNN and show why the model predicted what it did.  It cannot show any later prior.
- These drawbacks are addressed by the GRAD GAM described below


- Calculate the gradients by running .backward() function. (Pytorch does not store them, hence need to calc again during inference)
- Average the GRADIENTS of the feature maps of the last conv layer (= weights)
- Multiply WEIGHTS vs ACTICATIONS (as in CAM) to get the CAM Map to display.

### Pros
 - over comes all the issues of vanilla CAM
 - works for any images tasks (classification, segmentation, vQA)
 
### Cons
 - cannot locate mulitple objects within the images.
 
 
![](https://abhisheksreesaila.github.io/blog/images/general/cnn3.png "CNN - GRAD CAM")

- Why gradients equal same size as activation maps? 

The gradient is calculated for each pixel in the feature map. For example, if the activation map is 512 x 7 x 7, then then the number of graidents are also 512 x 7 x 7

- Why averaging gradients yields weights?

CAM uses WEIGHTS at the Fully Connected layer to choose the "feature maps" that is more relevant and squash the ones which are not.  So it is highly dependent of (CONV Layer => Global Average Pooling Layer ==> FC Layer) network.  GRAD-CAM uses this concept by make its more general.
  It uses GRADIENTS to provide the weights. We use the GRADIENTS in the last conv layer, do the global average pooling   ourselves, and now we have our weights!  we dont have to depend on specific GAP layer nor the weights of the fully connected layer. GRADIENTS provide a good enough "weighting mechanism" to pick the feature map that is relevant and squash the one which we dont.
<font size="3">  
Deep neural networks as well act as information distillation pipeline where the input image is being converted to a domain which is visually less interpretable (by removing irrelevant information) but mathematically useful for convnet to make a choice from the output classes in its last layer
</font>
 
 
# References
 
 https://glassboxmedicine.com/2020/05/29/grad-cam-visual-explanations-from-deep-networks/
 


## Matplot Lib Basics


```python

fig, ax = plt.subplot(3,2,figsize=(5, 5)) 

# Rows = 3; Columns=2; Total = 5 plots === Set the figure size to 5 inches to 5 inches
# Note that the size is defined in inches, not pixels

axs[0, 0].hist(data[0])   #1st axis
axs[1, 0].scatter(data[0], data[1])  #2nd axis
axs[0, 1].plot(data[0], data[1])  #3rd axis
axs[1, 1].hist2d(data[0], data[1])  #4th axis

plt.show()  # show the plot

ax.imshow() # show the image

#interpolation = use known data at unknown places (like extrapolate, interpolate)
 
```
[Check out various types here of interpolation here](https://matplotlib.org/stable/gallery/images_contours_and_fields/interpolation_methods.html)


## Python Decode Function

When you encode using a class (string class, data loader class etc.). you can use decode to undo it. Useful in bring back the image to its original form to display while "intrepreting" the test results.

- IMAGE ==> RESIZE ==> ENCODE (normalize to image net stats or something similar) ==> Output

- Output ===> DECODE ==> Original image (but still includes the resize)  ==> Display (-able)

