$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**<br><br>
In a neural network context, the receptive field is defined as the size of the region in the input that produces the feature.(Wikipedia).<br>
Can be explained as portion of the input needed, in order to create a specific feature that we are looking at, at any convolutional layer.<br>
the receptive fields of different features partially overlap and as such cover the entire input space.<br>
When stacking convolutional layers, the receptive fields are merged and each feature takes input from a larger area of pixels in the previous layer image.<br>
As intuition, it can be compared to our eyes which see only parts of our vision, the receptive fields starts with small portion of the input and later grows as the convolutions combine them together in order to make sense of what is seen.<br>
The receptive field size is affected by kernel size, padding and stride.

---

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer:**<br><br>
Growing receptive fields from layer to layer depends on the following:<br>
1. <u>Pooling-</u> Reducing the dimension of the feature map by combining features in the same region. by doing this, it affects the convolution layers by increasingly larger parts of the input image, which results in a rapid increase in the receptive field size respectively.<br><br>

2. <u>Stride-</u> is how far the filter moves in every step along one direction. hence, it Determines how big the overlapping of the receptive fields between features. larger strides cause smaller overlapping portion of pixels between features thus causing larger receptive fields between layers.<br><br>

3. <u>Dilation-</u> By increasing this factor, the weights are placed far away at given intervals (i.e., more sparse), and the kernel size accordingly increases. Therefore, by monotonously increasing the dilation factors through layers, the receptive   fields can be effectively expanded without loss of resolution.

---

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer:**<br><br>
Each receptive field size is derived from the layer before it, hence we can calculate the network receptive field recursively.<br>The recursive formula for the receptive field size of the output tensor: $$ r_k = r_{k-1} + (g_k-1)\cdot \prod_{i=1}^{k-1}s_i$$
$r_k$ - receptive field at layer k<br>
$g_k$ - kernel size for layer k<br>
$s_k$ - stride at layer k<br>
* Dilation is converted to a much larger kernel size convolution. So $Dilation_{2}^{7x7} = 2\cdot(7-1)+1 = 13$
* Activation functions dont change the receptive field
* $Pooling_{[2x2]}$ replaced by Conv2d with kernel size = 2 , stride = 2, padding =0, dilation=1
<br><br>
Applying the equation above we get the following:
Applying the equation above we get the following:
first layer:$$r_{1} = 3$$
pooling layer:$$r_{2} = 3+(2-1)\cdot1 = 4$$
conv layer:$$r_{3} = 4+(5-1)\cdot1\cdot2 = 12$$
pooling layer:$$r_{4} = 12+(2-1)\cdot1\cdot2\cdot2 = 16$$
Last layer:$$r_{5} = 16+(13-1)\cdot1\cdot2\cdot2\cdot2 = \Large{112}$$

---

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:**<br><br>
The main reason that the original and residual networks produce different filters lies in the fact that the filters of the residual layer try to learn the difference between the input and the output of the layer, as denoted in given formula, which was re-arranged: $$f_l(\vec{x};\vec{\theta}_l)=\vec{y}_l-\vec{x}$$

---

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:**<br><br>
Simplified in words:
- with the first dropout layer, we will drop 10% of the neurons, and remain with 90%.
- with the second dropout layer, 20% of the 90% remained neurons will be dropped,<br> hence remaining 80% out of 90% neurons, which are $0.9\cdot0.8 = 0.72 \rightarrow$ 72% neurons.<br>
<br>If we want to perform one single dropout in order to remain with 72% neurons, we need the following probability score: $$p=q= 1 - 0.9\cdot0.8 = \Large{0.28}$$

---

2. **True or false**: dropout must be placed only after the activation function.

**Answer:**<br><br>
**<u>False</u> <br>**
Usually the droput is applied after the activation functions, but nevertheless and in particular when using Relu, the dropout can be performed before applying activation,
where it is even more computationally efficient.

---

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer:**<br><br>
$x$ denotes as an activation vector, with expectation of $\mathbb{E}[x]$ and $p$ is the dropout probability. <br>
$\hat{x} = x\cdot(1-p)$ denoted as the same activation associated with dropout.
the Expectation of the activation vector associated with dropout will be the following: $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)]$$
Now, by using the properties of Expectation, we get:  $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)] = (1-p)\cdot \mathbb{E}[x]$$
$$\downarrow$$
$$\mathbb{E}[\hat{x}] = (1-p)\cdot \mathbb{E}[x]$$
Now, its easy to see that in order to maintain the value of each activation unchanged in expectation, **we need to to scale the dropout activation by $1/(1-p)$**<br>
Q.E.D

---

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer:**<br><br>
L2 loss fit for regression tasks whereas we have here a classification task. hence, binary cross entropy will do the trick, where it penalizes the model in cases of uncertainty, thus force the model to keep learning in order to predict eventually better.<br>
Lets demonstrate how the BCE penalize better than L2 in a case of uncertainty:<br>
lets suppose $p_{dog} = 0.45$. which is quite an uncertain classifaction score, almost as 50-50 score. lets see which loss penalize more:
$$L_2(p_{dog} = 0.45)= (1-0.45)^2  = 0.3 \\ L_{BCE}(p_{dog} = 0.45)= -log(0.45) = 0.798 $$
We see that $L_{BCE} > L_2$ and thus penalize better the uncertain prediction, because higher loss will force the model to train more than a lower loss score.

---

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer:**<br><br>
The chosen architecture seems to be deeper than necessary, without any batch normalizations and skip connections.<br>
Skip connections in deep architectures, as the name suggests, skip some layer in the neural network and feeds the output of one layer as the input to the next layers.<br>
when all of the mentioned above isn't applied, we can witness a phenomenon called 'vanishing gradients.<br>
The gradient becomes very small as we approach the earlier layers in a deep architecture. In some cases, the gradient becomes zero, meaning that we do not update the early layers at all, hence evntually the model stops its training.

---

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:**<br><br>
`Tanh` gradient also gets close to zero in a fast manner when the input gets far from zero, same as `sigmoid`. furthermore, `tanh` derivative - $sech^2$ also is bound on (0,1] as `sigmoid` therefore this activation change wont make any significant difference.<br>
`ReLU` on the other hand, keeps linearity in the intervals where `sigmoid` and `tanh` are tend to 1, thus dealing better with vanishing gradients.

---

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:<br>
    4.1. In a model using exclusively ReLU activations, there can be no vanishing gradients.<br>
    4.2. The gradient of ReLU is linear with its input when the input is positive.<br>
    4.3. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.<br>

**Answer:**<br><br>
4.1. **False** - Activation functions are not the only possible reason for vanishing gradients. vanishing gradients could be a result of too deep network without skip connections as well.<br><br>
4.2. **False** - The gradient of ReLU is constant (which equals to 1) whenever the input is positive.<br><br>
4.3. **True** -  inputting negative numbers into ReLU causes an output of 0 and gradient of 0 as well. hence those weights with respect to a specific neuron dont update thus causing dead neuron. this is why the leakyReLU comes in handy, in order to deal just with these kind of cases

---

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**<br><br>
The difference between the optimizers above lies in the amount of samples used to train.<br>
where `GD` using all of the training samples in each update, `SGD` using one random sample per each update and `mini-batch SGD` uses a fixed small 'batch' of samples in each update.<br>
Where it is obvious that in very large datasets `GD` is too expensive or even impossible in term of calculation time and memory, the `SGD` are more feasible and quicker to converge to local minimum. `mini-batch SGD` in that manner is obviously better than the one sample `SGD` which its updates are rather too arbitrary.

---

2. Regarding SGD and GD:<br><br>
    2.1. Provide at least two reasons for why SGD is used more often in practice compared to GD.<br><br>
    2.2. In what cases can GD not be used at all?<br><br>

**Answer:**<br><br>
2.1. <br>
    i. Slow Training: In `GD`, each update of the gradient can take alot of time due to the fact that `GD` updates all training samples per each iteration.<br>
    ii. `SGD` might generalize better due to adding some randomness to the model update with the selected samples while `GD` might overfit due to training on the full data.<br>
2.2. <br>
    i. When the dataset is too large, or/and we know in advance that we have a low memory machine. using a great amount of samples will just train really slow in the best case or will cause the memory to run out on the worst cast
    
---

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**<br><br>
Number of iterations will surely decrease since there are more samples for the model to learn from in each update thus the gradient direction would be more accurate and less noisy resulting in smaller loss in each iteration..<br>
Nevertheless, even if now less iterations needed for convergence, it doesnt mean that the training time will be faster, and probably on the contrary, it would take longer.<br>

---


4. For each of the following statements, state whether they're **true or false** and explain why.<br>
    4.1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.<br>
    4.2. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.<br>
    4.3. SGD is less likely to get stuck in local minima, compared to GD.<br>
    4.4. Training  with SGD requires more memory than with GD.<br>
    4.5. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.<br>
    4.6. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.<br>

**Answer:**<br><br>
4.1. **True** - We perform optimization step for each sample in every epoch using only one sample or a batch (mini-batch sgd).<br><br>
4.2. **False** - SGD uses only one sample, thus the variance will be higher in such case. nonetheless this behaviour will lead to faster convergence but rather less stable.<br><br>
4.3. **True** -  Due to the fact that per each update, new random samples are trained, this situation may lead the gradient to get out local minimas towards the global minima, unlike the classical GD which tends to securly and in a stable manner to hit local minime.<br><br>
4.4. **False** - As already stated above, GD consume more memory as it trains all of the data samples, instead of a smaller batch as SGD.<br><br>
4.5. **False** - GD have bigger chance to hit local minimas than SGD due to training all training samples which leads to a stable, same direction of the gradient.<br>
the SGD method gradient is less stable and the directions may vary in each iterations due to a different 'chunk' of training samples in each iterations leading to more chances to get out of local minimas, if stumbled upon.<br><br>
4.6. **True** - Newton method uses second order derivatives which are computationally more expensive than first order SGD & momentum. furthermore, newton method tend to stuck in saddle points which can be located in narrow ravine surfaces whereas SGD with momentum stables the oscillations in a narrow ravine which vanilla SGD suffers from.

---

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  <br>
    6.1. Explain the concepts of "vanishing gradients", and "exploding gradients".<br>
    6.2. How can each of these problems be caused by increased depth?<br>
    6.3. Provide a numerical example demonstrating each.<br>
    6.4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?<br>

**Answer:**<br><br>
6.1. <br>
<u>vanishing gradients</u> - the gradiends decrease as the propagations occurs through the networks until they reach a stage when they are too small and considered as 0.<br>
<u>exploding gradients</u> - the gradiends increase as the propagations occurs through the networks until they reach a stage when they are too big which causes the update step to be too large such that the optimizer cant fint a minimum.<br>
To sum up those phenomena, due to the chain rule of multipication big gradients get drastically increased and small gradients get drastically decreased.<br>
6.2. <br>
As written above, when propogating the gradients, the current gradient value is multiplied by the value of the previous layer's gradient and so on. increaseing the layers meaning increasing the multipications thus increasing large numbers to even greater numbers and small numbers to even smaller numbers. either too high or too low gradients with too deep network can rapidly increase to infinity or decay to 0, respectivly.<br>
6.3. <br>
Lets assume we have a deep'ish NN with `n` layers. each layer multiply the input `x` with a weight `w`. finally the the result is going through the activation function.<br>
let's denote the following with an assumption that all weights throguhout the network are equal in the magnitude in general and in our particular exam really equal: $$ w= 0.5$$
lets assume we have $n=10_{layers}$. so after the weights have been multiplied by 0.5 in each layer we have $ w= 0.5^{10} = 0.0009$ which will vanish with more layers as it tend to 0.
Now if we switch to high weight, lets say $w=5$ we can see that after 10 layers $ w= 5^{10} = 9765625$ which is exploding if kept going deeper through the net which is described as exploding gradients.<br><br>
6.4. <br>
It can be interpreted by looking at the loss functions values and curve:
* Constant loss rates - loss remains practically the same, meaning weights not being updated, a sign of **vanishing gradient**
* Unstable loss rate - loss oscillates dramatically with large magnitudes, meaning weights are getting very high values, or maybe even getting overflows which are all signs of **exploading gradients** 

---

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer:**<br><br>
We first denote the following: $$Z = \mat{W}_1 \vec{x}+ \vec{b}_1$$
Now, deriving by $x$:
$$ \frac{dL_s}{dx} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dx} =  \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot W_1 \cdot\frac{d\varphi}{dz}$$
Deriving by $b_1$:
$$ \frac{dL_s}{db_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_1}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_1} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot\frac{d\varphi}{dz} $$
Deriving by $b_2$:
$$ \frac{dL_s}{db_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_2}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_2} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}}) $$
Deriving by $W_1$:
$$ \frac{dL_s}{dW_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_1} + \lambda\norm{\mat{W}_1}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_1}+ \lambda\norm{\mat{W}_1}_F= \newline
\frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\frac{d\varphi}{dZ}x + \lambda\norm{\mat{W}_1}_F $$
Deriving by $W_2$:
$$ \frac{dL_s}{dW_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_2} + \lambda\norm{\mat{W}_2}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_2}+ \lambda\norm{\mat{W}_2}_F = \newline
\frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot \varphi(Z) + \lambda\norm{\mat{W}_2}_F $$

---

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer:**<br><br>
- Let there be $\epsilon = \Delta\vec{x}$ which is as small as we please. <br> by inserting $\epsilon$ we can approximate the derivative stated above at a certain $x_0$. in particular in NN we can use the formula above and numerically derive the loss function so that we dont have to use AD.<br><br>

The resulting drawbacks:
1. the result is dependent on our $\epsilon$ pick. too big - poor approximation, too small - might cause floating point truncations and lead to wrong answer. both options might lead to inaccurate derivative.
2. Stability and computabillity - this operation might not be numerically stable due to summing large numbers with small numbers and dividing them with really small number. computing gradients with multi components might be computational expensive as a large network with many parameters will lead to long time computing.

---

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
grad_W =torch.zeros_like(W)
grad_b =torch.zeros_like(b)
eps = 1e-6
for i in range(d): # b update
    b_tag = torch.clone(b)
    b_tag[i]+= eps
    fdx = foo(W,b_tag)
    f= foo(W,b)
    grad_b[i] = (fdx-f)/eps
    
for i in range(d): # W update
    for j in range(d):
        w_tag = torch.clone(W)
        w_tag[i,j]+= eps
        fdx = foo(w_tag,b)
        f= foo(W,b)
        grad_W[i,j] = (fdx-f)/eps
loss.backward()
# TODO: Compare with autograd using torch.allclose()
autograd_W = W.grad
autograd_b = b.grad
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.5294, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer:**<br><br>
- Word embedding is a representation of a word as a token or vector in a higher dimension space. The idea of these vectors is that they encode the meaning of the word as well so that words with similar meaning and semantics will be represented by vectors that are close to each other after the mapping is done. It is used to group words in a way that they are similar to process and compare. 
- It can be done, it is possible but will not work out that well. However there are two main disadvantages to use one-hot encoding as we did on tutorials. With these models, we are encoding orthogonal vectors for each word. Firstly, the model will turn out to be very large as the dimension of the vectors will increase to the number of different input words. Second consideration is that we fail to capture the semantics of each word resulting in longer training time and lower performance.

---

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**<br><br>
- Y contains the embedding or mapping of the words in X to a higher dimension (42000). X contains [5,6,7,8] words with values from 0 to 42 that are represented in Y after embedding by vectors of dimension 42000. This is why in Y we need the same amount of elements than in X but each one with its 42k-dimension encoding.
- To solve this we need to represent the words and then create the mapping into the desired space. For the representation, we can use one-hot encoding of the input words in X. Each of these will be an element of a tensor. Then, we can produce a mapping to a higher dimension (42k dim) by applying a linear layer nn.Linear to the tensor (using the dimension as parameter).

---

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
    3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer:**<br><br>
3.4. **False** The backpropagation algorithm remains the same. TBPTT introduces a limitation in the calculation to only consider X amount of steps. This tecnique is useful to deal with vanishing gradients. 

3.5. **False**. As stated, TBPTT limits the steps so that the number of derivatives required for weight update is controlled. By limiting the lenght of sequence, we don't make sure these steps are truncated. TBPTT implementation is based on limiting the timesteps per run.

3.6. **True**. As the algorithm truncates to S timesteps, it means that during forward pass and backpropagation, dedicated memory can only store those steps. For any new input, relations can be found within the available previous S timesteps that are being considered in the step run. 

---

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


**Answer:**<br><br>
- The attention mechanism is based on the ability for the decoder to search all hidden states saved on the encoding process to be able to locate the most suitable word to output. This requires to save information from the entire input sequence on the hidden states (to produce a context vector) so that it acts as feedback and tool for the decoder to "pay attention" to the "most correct" word at each timestep. For models without attention, the mentioned context vector is based only on the last hidden state of the encoder. The results of this is reducing the accuracy and ability to learn from long sequences.
- We expect this change to restrict each of the hidden states to produce outputs that are based on neighbour words on the sentence. With self-attention, instead of the decoder using the whole amount of hidden states, it allows the model only to base its results on the past hidden layer state.

---


### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

**Answer:**<br><br>
The KL-divergence term plays the role of a regularizer to avoid overfitting to the training set. Not including it will:
- For the reconstruction of the training, the input trainig images will only be compressed and decompressed recreating back the very similar images, not having any qualitative impact on the results.
- Without KL-divergence term, the model will not learn the distribution of the training images set and will not be able to generate good quality/similar images. The new generated images will be built by sambilng a N(0,I) distribution that has no relation to the original dataset.

---

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer:**<br><br>
- 2.3: **False**. The model generates a latent-space distribution that will then be approximated with reparametrization trick by sampling from $\mathcal{N}(\vec{0},\vec{I})$ and then calculating the latent space with the mean and variance of the posterior distribution.
- 2.4: **False**. Particularly the mentioned process of random sampling from the distribution determines that despite of feeding the encoder with same image multiple times, the decode result will have different characteristics.
- 2.5: **True**. Within the VAE loss, the KL-divergence term is defined from the actual posterior distribution of the images. This value is not computationally calculated as it depends on each of the sampling made. What we do is obtain the evidence lower bound and the minimize the upper bound in hope that it is tight.

---

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
    5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer:**<br><br>
- 2.3: **True**.  Ideally the claim is true as the generator loss needs to be low so that it generates great images and the discriminator loss needs to be high so that it gets fooled and cannot identify what is true and what is false. In practice, the models are needed to find an equilibrium between the losses, particularly to decrease the discriminator loss (meaning it gets better finding false generated images) in order to push the generator to improve its model and performance.
- 2.4: **False**: It is not needed to backpropagate to the generator while training the discriminator. The discriminator needs to assume constant and unchaning images as input to learn to discriminte. Backpropagating would also train the generator on the same step and provide chaning inputs to the discriminator. This considers to be restricted to the distriminator training on its own and not the training of the whole GAN.
- 2.5: **True**. any distribution can be chosen to be sampled from in order to generate new images. This was done on the guided implementation on Part2_GAN.
- 2.6: **True**. It might be beneficial as the discriminator will initially start overfitting real images and then the discriminator will need to learn harder and improve to fool the discriminator.<br>
- 2.7: **False**. When the discriminator reaches a stable state of 50% accuracy, it means that it cannot tell what images are real and which are fake. This gives no information to the generator on what to do to improve its images quality. Then, by training the generator more, it will reduce its quality and results.

---

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

**Answer:**<br><br>
The Intersection-Over-Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. it can give a numerical score whether a predicted segment is close enough to the ground truth from 0(no match) to 1 - perfect prediction.<br>
Dice on the other hand is  twice(2X) the area of overlap between the predicted segmentation and the ground truth divided by the combined area of prediction and ground truth.<br>
Dice can be used at similar circumstances as IoU and they are often both used.<br> but there is a subtle difference between them though: **Dice score tend to veer towards the average performance whereas the IoU helps you understand worst case performance**
So, in general,we can use IoU to determine for each segment if the prediction is TP/FP,FN. 
afterwards we can build the precision-recall curve and  use mAP to generalize it into a single value representing the average of all precisions from all the segments.<br>
Nowdays, using mAP makes more sense as it is a better representation of the quality of the model, rather then using F1(Dice) to understand the imbalances between the precision and recalls of the segments.

---

2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

**Answer:**<br><br>
YOLO - one stage detector. demands single pass to NN in order to predict all bounding boxes/areas.<br>
mask-r-CNN - two stage detector. first uses RPN to generate regions of interest.<br>
<u>RPN outputs</u>: 
A Region Proposal Network, or RPN, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals where it takes an image as input and output set of bounding boxes proposels with repective score.<br>
<u>YOLO outputs</u>: 
YOLO is formed of 27 CNN layers, with 24 convolutional layers, two fully connected layers, and a final detection layer.
YOLO divides the input images into an N by N grid cell, then during the processing, predicts for each one of them several bounding boxes to predict the object to be detected.

---