# Ex7 - Deep Learning Questions - Computer Vision

## 1. Explain each of these terms in a sentence or two:

### 1. Cross entropy
Cross-entropy measures the performance of a classification model based on the probability and error, where the more likely (or the bigger the probability) of something is, the lower the cross-entropy.

### 2. Residual connections
A residual connection connects the output of one earlier convolutional layer to the input of another future convolutional layer several layers later. 
Several intermediate convolutional layers are skipped.

### 3. Adam optimizer
Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.

### 4. Cyclic learning rate
Cyclic learning rate is a technique to set and change and tweak LR during training.
This methodology aims to train neural network with a LR that changes in a cyclical way for each batch, instead of a non-cyclic LR that is either constant or changes on every epoch. The learning rate schedule varies between two bounds.

### 5. Dropout
Dropout is a method where randomly selected neurons are dropped during training. They are “dropped-out” arbitrarily. This infers that their contribution to the activation of downstream neurons is transiently evacuated on the forward pass and any weight refreshes are not applied to the neuron on the backward pass.

### 6. Bottleneck layer
A bottleneck layer is a layer that contains few nodes compared to the previous layers. It can be used to obtain a representation of the input with reduced dimensionality. An example of this is the use of autoencoders with bottleneck layers for nonlinear dimensionality reduction.

### 7. 1x1 convolution
1x1 convolution can be seen as an operation where a 1 x 1 x K sized filter is applied over the input and then weighted to generate F activation maps.<br> F > K results in an increase in the filter dimension whereas F < K would cause an output with reduced filter dimensions.

### 8. DenseNet
A DenseNet is a type of convolutional neural network that utilises dense connections between layers, through Dense Blocks, where we connect all layers (with matching feature-map sizes) directly with each other.

## 2. 
Explain the pros and cons of using small and large batch sizes.<br><br>
**Answer:**<br>
- higher batch sizes leads to lower asymptotic test accuracy
- we can recover the lost test accuracy from a larger batch size by increasing the learning rate
- starting with a large batch size doesn’t “get the model stuck” in some neighbourhood of bad local optimums. The model can switch to a lower batch size or higher learning rate anytime to achieve better test accuracy
- too large batch size will lead to poor generalization
- small batches go through the system more quickly and with less variability, which fosters faster learning.

## 3. 
How many 3x3 filters are needed to replace a 7x7 kernel? Compare the number of parameters in each option.<br>
**Answer:**<br>
Three 3x3 filters sequentially can replace 7x7 filter.
To apply two 3x3 kernels, we need (3x3 + 3x3 + 3x3) = 27 weights. But, using one 7x7 kernel, we will need 49 weights.

## 4. 
You are training a neural network for classifying images on a custom dataset and it doesn't seem to learn anything. Describe your approach to solving the issue.<br>
**Answer:**

- Look for Variables that are created but never used (usually because of copy-paste errors)
- Look for Expressions for gradient updates are incorrect
- Look for Weight updates that are not applied
- Look for Loss functions that are not measured on the correct scale (for example, cross-entropy loss can be expressed in terms of probability or logits)
- Look for The loss that is not appropriate for the task (for example, using categorical cross-entropy loss for a regression task).
- Look for Dropout that is used during testing, instead of only being used for training.
- Make sure you're minimizing the loss function L(x), instead of minimizing −L(x).
- Make sure your loss is computed correctly.

## 5. 
Mention the problems imbalanced datasets can cause to Deep Learning problems, and suggest a few ways to avoid them.<br>
**Answer:** <br>
The main problem with imbalanced dataset prediction is that we fail to predict accurately minority class. <br>
**Ways to avoid problems:**
- Changing the Performance Metric
- Resampling the Dataset by adding copies of instances from the under-represented (over-sampling) or deleting delete instances from the over-represented class (under-sampling)
- Generatinh Synthetic Samples to randomly sample the attributes from instances in the minority class.
- Reworking the problem itself by tackling an imbalanced classes problem: the classifier and the decision rule have to be set with respect to a well chosen goal that can be, for example, minimising a cost

## 6. 
Given two networks: SRCNN (https://arxiv.org/pdf/1501.00092.pdf) and Unet (https://arxiv.org/pdf/1505.04597.pdf) . What is the receptive field of a pixel in each network? You should consider one pixel (let’s say the center pixel of the output) and go back to see what neighbors at the input image influence this pixel.<br><br>
**Answer:**<br><br>
Algorithm for calculating receptive field size:<br>
r=1<br>
S=1<br>
for l = 0 to L do<br>
>  for i = 0 to l do<br>
>>    S=S*si<br>
>>    r=r+(kl-1)*S<br><br>


**SRCNN:**<br>

```python
stride1 = 14 
stride2 = 1 
stride3 = 1

class SRCNN(nn.Module):
    def __init__(self, num_channels=1):
        super(SRCNN, self).__init__()
        self.conv1 = nn.Conv2d(num_channels, 64, kernel_size=9, padding=9 // 2)
        self.conv2 = nn.Conv2d(64, 32, kernel_size=5, padding=5 // 2)
        self.conv3 = nn.Conv2d(32, num_channels, kernel_size=5, padding=5 // 2)
        self.relu = nn.ReLU(inplace=True)
```

r1 = 1 + (9-1) * 14 = 113 <br>
r2 = 113 + (5-1) * 14  = 169 <br>
r3 = 169 + (5-1) * 14 = 225 <br>

Hence,  SRCNN's receptive field size is: 225 <br>

**U-NET:**<br>
```python
class UNet(nn.Module):
    def __init__(self, n_channels, n_classes, bilinear=False):
        super(UNet, self).__init__()
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.bilinear = bilinear

        self.inc = DoubleConv(n_channels, 64)
        self.down1 = Down(64, 128)
        self.down2 = Down(128, 256)
        self.down3 = Down(256, 512)
        factor = 2 if bilinear else 1
        self.down4 = Down(512, 1024 // factor)
        self.up1 = Up(1024, 512 // factor, bilinear)
        self.up2 = Up(512, 256 // factor, bilinear)
        self.up3 = Up(256, 128 // factor, bilinear)
        self.up4 = Up(128, 64, bilinear)
        self.outc = OutConv(64, n_classes)
```
<br>
We have a total of 23 layers in whcih there are:<br> 
4 max pooling (2x2) layers with stride of 2 <br>
18 conv layers (3x3) with stride of 1 <br>
1 conv layers (1x1) with stride of 1 <br>
receptive field size = <br>
1+ (3-1) * 1 + <br>
1+ (3-1) * 1 +<br>
1+ (3-1) * 1 +<br>
1+ (3-1) * 1 +<br>
1+ (2-1) * 2 +<br>
1+ (3-1) * 2 +<br>
1+ (3-1) * 2 +<br>
1+ (2-1) * 4 +<br>
1+ (3-1) * 4 +<br>
1+ (3-1) * 4 +<br>
1+ (2-1) * 8 +<br>
1+ (3-1) * 8 +<br>
1+ (3-1) * 8 +<br>
1+ (2-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (3-1) * 16 +<br>
1+ (1-1) * 16 =<br>
=373  <br>
Hence,  U-Net's receptive field size  is: 373 <br>

## 7. 
Given a standard CNN consists of convolutional layers and then fully-connected layers. Explain what layer in CNN can reach a lot of parameters? How can we avoid it?

**Answer:**<br>
FC layers increase the number of parameters so after each such layer we get more parameters.
We can avoid getting lots of parameters by increasing number of convolutional layers or decreasing the number of FC layers.


## 8. 
What is the result of convolving an image X with a filter h = [-1 -1 -1; 0 0 0; 1 1 1].

In [18]:
import scipy.ndimage as ndi
import numpy as np
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning) 

x = np.array([[-1, -3, -4,  0, -1],
               [ 2, -2, -4,  0, -2],
               [-3, -2,  2,  2,  3],
               [ 0, -3, -4, -4, -2],
               [-4, -2,  2,  0,  1]])

k1 = np.array([[-1, -1, -1], [0, 0, 0], [1,1,1]])



print('x * k1 =')
print(ndi.correlate(x, k1),end='\n\n')

x * k1 =
[[ 7  4  1 -1 -2]
 [-3  5  9 12 10]
 [-5 -3 -5 -4 -4]
 [-2 -1 -2 -4 -6]
 [-7  3 11 13 10]]



## 9. 
(a) What if all the weights are initialized with the same value?<br><br>
**Answer:**<br>
If all weights are the same, all units in hidden layer will be the same too. <br>
If the gradients are equal then weights are going to be updated by the same amount. The weights attached to the same neuron, continue to remain the same throughout the training. It makes the hidden units symmetric and this problem is known as the symmetry problem - the network can't learn<br><br>
(b) What happens if we set all the biases to be zero (ignoring the biases)?<br><br>
**Answer:**<br>
Setting biases to 0 will not create any problems as non-zero weights take care of breaking the symmetry and even if bias is 0, the values in every neuron will still be different (as long as different weights were used upon initialization)

## 10. 
Write down two NN models (it can also be convolutional networks) that have (more or less) the same number of parameters, but with different power of computational. Explain your answer.

**Answer:**<br>

params = weights + biases = i × (f×f) × o + o <br>

**CNN1:** Greyscale image 10x10 with 2×2 filter, output 3 channels <br>
i = 1 (greyscale has only 1 channel) <br>
f = 2 <br>
o = 3 <br>
params = = 1 × (2×2) × 3 + 3 = 15<br>
calculations =  3 × (2×2) × 1 × 2 × 10×10 = 2400 <br>

**CNN2:** RGB image 20x20 with 2×2 filter, output of 1 channel <br>
i = 3 (RGB image has 3 channels)<br>
f = 2<br>
o = 1<br>
params = 3 × (2×2) × 1 + 1 = 13 <br>
calculations =  1 × (2×2) × 3 × 2 × 20×20 = 9600 <br>

We can see the despite the fact that number of parameters is almost the same (15 vs 13) the computational power (number of calculations) differs significantly due to the difference of the input image size

## 11.
You moved into a new apartment and you would like to make your front door smarter. Write an algorithm that alerts every time there is someone outside your door who is not a part of your family (you decide who is considered to be family).

**Answer:**<br>
We need to train our model on greyscale images of our family members and some other strangers. We'll label family members with 1 and strangers with 0.<br>
We'll init our weight vector with zeros. <br>
We'll start iterating over our training data and for each image smaple we'll  multiply the weights by the inputs.
We set learning rate and multiply the results by it before summing them up. That's how we get the dot product. <br>
Next we would compare the dot product with the predefined threshold to calculate a new estimate, update the weights, and then keep going. If our data is linearly separable, the Perceptron will converge. <br>
After we have completed training the model we can validate it by sending him new greyscale images both for family members and strangers.<br>
The output will be 1 for family members and 0 for strangers.

## 12. 
These days, most of the parking lots have a system that recognizes the license plate number of an entering car. This helps to automatically open the gate in case that the driver paid before arriving at the exit gate. For this purpose, two cameras are needed - one at the entrance and one at the exit. Both capture the license plate number and translate it into a series of numbers.

Write an algorithm that for a given image containing a car with a license plate number,
recognizes all the numbers in this plate. Pay attention that the angle and the location of the
license plate can vary between images.

For simplicity: all the license plates have the same size (in the real world, not in the image), have 7 digits and have the same font. The plate is yellow with black digits. Some pictures are attached.

Moreover, no trucks or motorcycles can enter the parking lot. If you have more assumptions, please write them down.

**Answer:**<br>
1. Pre-processing:<br>
      1.1 Convert the gray scale image into binary image using Otsu’s algorithm by
calculating thresholding. <br>
      1.2 Remove all the objects containingfewer than 30 pixel. Median filter to remove the noise.<br>
      1.3 Calculate connected components of an image byscanning the image, pixel by pixel (from top to bottom and leftto right)in order to identify connected pixel regions. <br>
      1.4 Search for connected components in the image, eachconnected component will be assigned a special label in order to distinguish between different connected components in the image.<br>
      1.5 Resize each character from the previous step to thestandard height and width in order to be used in therecognition process.<br>
      1.6 Measure properties of image regions by ploting bounding box to get the separate character and numbers forrecognition process.<br>
2. Number plate localizarion - a number of algorithms are suggested for number plate localization such as: multiple interlacing algorithm, Fourier domain filtering, and colour image processing.  <br>
      2.1 A statistical Median filter is used to remove salt and peppernoise from the image inn gray scale before binarizing. Wehave used a 3 *3 masking sub window for this purpose.  <br>
      2.2 Connected Components - Connected components labeling scans an image andgroups its pixels into components based on pixel connectivity, i.e. all pixels in a connected component share similar pixel intensityvalues and are in some way connected with each other. Onceall groups have been determined, each pixel is labeled with agray level or a color (color labeling) according to the component it was assigned to.<br>
3. Charachter segmentation -   <br>
      3.1 Working with bounding box - the minimum or the smallest bounding box for any point set in N dimention is the box with smallest measure within whichall point lies. In the other words it has the minimum heightand width that cover all the pixels present in a particularconnected component or region.<br>
      3.2 Selecting the best Bounding Box by <br>
      * Contrast present in the bounding box
      * Aspect Ratio
      * Width of the license plate
      3.3 Cropping the Bounding Box - after identifying the best possible bounding box candidate forthe license plate the coordinates of the bounding box are notedand the box is cropped from the image and sent to charactersegmentation module for further processing
4. Charachter recognition - the image obtained after segmentation is Grayscale.Follow the preprocessing steps used for the training of thecharacters. Calculate the score for each of the characters: We calculate the matching score of the segmented characterfrom the templates of the character stored by the following algorithm. We compare the pixel values of thematrix of segmented character and the template matrix, and for every match we add 1 to the matching score and for everymiss-match we decrement 1. This is done for all 225 pixels. The match score is generated for every template and the one which gives the highest score is taken to be the recognized character.
5. We got the number :)
