$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

 **ANSWER**
 
1. q = 1 - (1 - p1)*(1 - p2)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. **True or false**: dropout must be placed only after the activation function.

**ANSWER** 

2. False - in general it makes more sense to place the dropout after the activation function but when using ReLU one might consider to do it before.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**ANSWER**

3. By scaling we make sure to get the expected output, X, as follows:

dropout with a drop-probability of p => X*(1-p)

After Scaling by 1/(1-p) we get:

(X*(1-p))/(1-p) = X


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**ANSWER:** 

1. You would not use L2 because it is used for regression and not classification, instead I would use cross entropy. 
Here's a numeric example of why L2 would not be used in this case:

We have 2 images that are dogs and our model predicts that image 1 is a dog and image 2 is a hotdog with the follwoing probabilities:

p(image 1) = .6

p(image 2) = .4

(1-.6)^2 = .16

(1-.4)^2 = .36

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [4]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**ANSWER** 

2. When using a network with many layers it is likely that you will run in to the vanishing gradient problem and it will cause your model to stop training because the weights become too small 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**ANSWER** 

3. This would not sovle the problem because the derivative of tanh lies between -1 to 1 and the updated weight values are really small which also leads to the vanishing gradient problem. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**ANSWER**

5. False - ReLu can lead to the vanishing gradient problem with inputs smaller than 0
6. False - If the input is > 0 then the gradient of ReLu is equal to 1
7. True - Once the neuron get a negative input it will always return 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**ANSWER**

Stochastic Gradient Descent - Uses only one of the samples from the training set to update parameter in an iteration 

Mini-Batch SGD - Uses only one subset of the samples from the training set to update parameter in an iteration 

Gradient Descent - Runs through all of the samples in the training set to do a single update of parameter in an iteration

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**ANSWER**

3. If the number of samples are very large then GD will take too long and SGD will be much quicker because it only uses one sample. Another reason to use SGD (when the dataset is large) is that it converges faster than GD and the approxmimation that SGD returns is generally good enough. 

4. If you have too much data it may never converge

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**ANSWER**

3. We would expect the mini-batch SGD to converege in less iterations because it uses more samples it would have higher accuracy. Therefore we expect the number of iterations to Decrease

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer**

5. True - we perform an optimization step for each sample in our dataset during every epoch by using one sample (as seen in answer 1).
6. False - While SGD may converge quicker in the large datasets it tends to have more variance than GD and causes the function to fluctuate heavily. 
7. True - SGD's flucutaion makes it unlikely for it to get stck in local minimm
8. False - SGD uses less memory because it only uses one sample while GD goes through the whole dataset
9. False - It depends on if it's non-convex or convex, it almost certainly converges to a local minimum for non-convex and global minimum for convex optimization.
10. False - if Newton's method converges then it will be quicker than SGD with momentum

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**ANSWER**


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**ANSWER**

7. 

Vanishing Gradients - this is when the derivatives become so small that it can't update the weights properly and causes the network to stop training. 

Exploding Gradients - is the opposite, it's when the derivates get so large that when it updates the weights they vary too much from the old weights and the gradient will never converge. 

8. As per the chain rule we calculate the gradient loss by multiplying the derivatives in all of the layers. This causes the vanishing problem with small values because it decreases exponentiall and the opposite with large values, it increases exponentially causing the exploding problem. 


9. When we have sigmoid(-10) then the derivative is approximately 0.0005, as we can see it's already very close to 0 so if we have a deep network then we will run into the vanishing gradient problem.  

If our initial weight is larger than 1 then we run into the exploding gradient problem. So if we have an initial weight of 1.6 then by the 15th layer it will be equal to 281.47 and we already have the exploding gradient problem. 

10. Exploding gradient problem will have large changes in loss during each update and the vanishing gradient problem the training will stop early and the weights will become (close to) 0.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**ANSWER**

For Simplicity: 
$
\vec{z} = \mat{W}_1 \vec{x}+ \vec{b}_1
$

$
\frac{dL_S}{d{W}_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d{W}_1} + \lambda(\norm{\mat{W}_1}_F) = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{d\mat{W}_1} + \lambda(\norm{\mat{W}_1}_F)
$


$
\frac{dL_S}{d{W}_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d{W}_2} + \lambda(\norm{\mat{W}_2}_F) = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{d\mat{W}_2} + \lambda(\norm{\mat{W}_2}_F
$


$
\frac{dL_S}{d\vec{b}_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\vec{b}_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{d\vec{b}_1} 
$


$
\frac{dL_S}{d\vec{b}_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\vec{b}_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{d\vec{b}_2} 
$


$
\frac{dL_S}{d\vec{x}} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\vec{x}} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{d\vec{x}} 
$

$
\frac{d\hat{y}}{d\mat{W}_1} = {W}_2\frac{d\varphi}{d\vec{z}}\vec{x}
$

$
\frac{d\hat{y}}{d\mat{W}_2} = \varphi(\mat{W}_1 \vec{x}+ \vec{b}_1)
$

$
\frac{d\hat{y}}{d\vec{b}_1} = {W}_2\frac{d\varphi}{d\vec{z}}
$


$
\frac{d\hat{y}}{d\vec{x}} = {W}_2\frac{d\varphi}{d\vec{z}}\mat{W}_1
$

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**ANSWER**

3. We can use it to give us an approximation by using two infinitesimal different inputs

4. The first drawback is that it's just an approximation so it is not really accurate and the second drawback is the computational time, it takes much longer to compute than AD. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [15]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b

delta = 0.0000001
grad_W = torch.zeros(W.shape, dtype=dtype)
grad_b = torch.zeros(b.shape, dtype=dtype)
W_clone = torch.clone(W)
b_clone = torch.clone(b)

for i in range(d):
    for j in range(d):
        W_clone[i][j] += delta
        grad_W[i][j] = (foo(W_clone, b) - loss) / delta
        W_clone[i][j] = W[i][j]
        
for i in range(d):
    b_clone[i] += delta
    grad_b[i] = (foo(W, b_clone) - loss) / delta
    b_clone[i] = b[i]

loss.backward()    
# TODO: Compare with autograd using torch.allclose()
autograd_W = W.grad
autograd_b = b.grad
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=tensor(2.0116, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**ANSWER**

2. "A word embedding is a learned representation for text where words that have the same meaning have a similar representation." It is used in the context of language models because when words are similar are more closely related in vector space. 

3. Yes but it won't be as good as word embeddings, because it won't necessarily pair words that are similar to each other which will result in a low accuracy. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 

In [16]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**ANSWER**

3. Y contains an embedding for X, it maps each sample of X to a vector of size 42000 and each sample of X ranges from 0 to 42. Y contains the first 4 dimensions of X with the vector size of 42000 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**ANSWER**

4. True - TBPTT uses a modified version of the backpropagation algorithm where the sequence is processed one timestep at a time and the Backpropagation through time update is performed back for a fixed number of timesteps.

5. False - You also need to limith the length of S during the backpropagation

6. True - You specify with S the number of timesteps to use for updates

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


**ANSWER**

2. Without attention the performance on long inputs and outputs is poor because the encoder uses a fixed size internal representation and only looks at the last hiddent state. With attention we are able to access all the past hidden states and it can specifically pick the relevant parts of the input sequence.   

3. I expect that the hidden states will have learned more accurately because attention focuses on the most important parts of the sequence. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

**ANSWER**

2. We expect that the images reconstructed by model during training to show results that are very close to actual but it will be because the model is overfitting. 

3. Here we expect the model to produce bad resutls because we left out the KL-divergent we ignore z and won't produce images similar to the input

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**ANSWER**

3. False - The latent space distribution generated by the model can be different but if you kl divergence then you can get $\mathcal{N}(\vec{0},\vec{I})$.
4. False - We won't get the same results because it is being generated randomly by a probability distribution 
5. True - Since we cannot actually compute the posterier we minimize the ELBO and hope the upper bound is tight. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  1. It's crucial to backpropagate into the generator when training the discriminator.
  1. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  1. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  1. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**ANSWER** 

3. False - The loss should be low for both the generator's loss and discriminator's loss but the smaller the discrimantor loss then the generator loss increases. 

4. False - The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator network therefor we do not have to backpropagate into the generator. 

5. True - We are able to use $\mathcal{N}(\vec{0},\vec{I})$ as latent spance but we must be consistent througout the training

6. True - this will help the generator determine if an image is real or fake

7. False - Training more won't necessarily lead to better images and it might just cause overfit. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

**ANSWER**

1. 

Dice = 2 |A∩B| / (|A|+|B|) = 2 TP / (2 TP + FP + FN) -> it is a combination of the precision and recall 

Intersection over Union (IoU) = |A∩B| / |A∪B| = TP / (TP + FP + FN) -> it tells us the amount of overlap between the actual and predicted

Both scorings are similar and have a high correlation but it is customary to use the Dice score for segmentation and the IoU score for detection 

Mean average precision (mAP) - The mAP returns a score based on the ability to predicted the bounding box of the mode, a high sore tells us that the model is more accurate in it's detections. 

mAP is generally used for localization and classification of an object 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

**ANSWER**

2. You only look once (YOLO) is considered a one stage detector, because it requires a single pass throught the neural network. 

The neural network produces a four dimensional output, the YOLO then takes the last two diminesions and flattens it to return a three dimensional space.

RPN uses all of the anchor boxes as input and outputs the prediction for the boundary box (rectangular shape). 