$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 4: Summary Questions
<a id=part4></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer**:
The receptive field in the context of CNNs is a defined portion of space containing units that provide input to a set of units within a corresponding layer.
It is an area of the input that one of a layer's features is affected by.
e.g, if the input was an image, a receptive field of a feature could be the pixels that affect the feature's calculation.


2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer**:

There are many different ways to control the rate at which the receptive field grows from layer to layer. Here are three of them:

a. The dilation- this parameter affects the distance between pixels in the kernel. If we were to increase this parameter we are giving the filters the ability to calculate patterns that are more spread out without increasing the kernel size.

b. The kernel size- if we were to increase the kernel size  it will cause the area that the feature is affected by to increase too, that also leads to growth in the receptive. The result of the change of the size, an increase or a decrease, means  more or less adjacent pixels would be combined in each filter calculation.

c. The stride size- the stride size is a measure of the distance between two filter activations. Decreasing the stride size increases the receptive field of a feature. This parameter is responsible for the relation between different pixels in the input,if we are increasing the stride we are causing relatively distant regions to interact together in future layers.


3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer**:

The size of the receptive field after N convolutional layers, where s is the stride and  k is the kernel size, equals:  $1 + \sum_{i=1}^{N} ((k_i-1)* (\prod_{j=1}^{i-1} s_i))$
If we would like to use the above formula we need to also take care of the ReLu activation function, the dilation parameter and the Pooling layers.
ReLu- we can actually ignore these layers since they don’t affect the receptive field’s size. Dilation- we can define a new kernel size, which will be the effective circumference of the dilated kernel, that means that it dimensions are 1+2*(7-1) = 13 since we have 2 pixels between each original kernel pixel. Pooling- the pooling layer with kernel of size 2 is effectively a convo-layer with both kernel size 2 and stride.
Now we can apply the formula above on all of the model’s layers:

(2−1)⋅1+(3−1)⋅1+(5−1)⋅2⋅1+(13−1)⋅2⋅2⋅2⋅1+1+(2−1)⋅2⋅2⋅1 = (2−1)⋅1+(5−1)⋅2⋅1+112(3−1)⋅1+(13−1)⋅2⋅2⋅2⋅1+1(2−1)⋅2⋅2⋅1 = 112

That means that we have a receptive field of 112 for every pixel in the output tensor.


4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer**:
    
Using ResNet networks we can optimize our model better by using gradients that would have otherwise vanished without the layer skipping. Moreover, as a result of the different structure of the model, each layer learns a different function than before (by subtracting the input from the original function). The result of these changes, leads to a difference in the filters.


### Dropout

1. **True or false**: dropout must be placed only after the activation function.

**Answer**:

false.

The ReLu activation function before or after the Dropout layer will result in identical output because the ReLu activation function maintain f(0) = 0.


2. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer**:
    
The average after an activation function is E(x).
Therefore in order to add dropout layer with probability of p
we need tha the average will be (1−p)⋅E(x).
So the scaling needed is 1/(1−p) to reach to the original E(x).

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer**:

Using the L2 loss is a **bad** idea in this case.
Correct classifications may have larger losses than incorrect classifications.
For example, two dog images could give the following scores:

first sample: score1 = 0.4, score2 = 1 -----> L2_error = 0.5(0.4^2 + 1^2) = 0.58, correct classification

second sample: score1 = 0.6, score2 = 0.6 ------> L2_error = 0.5(0.6^2 + 0.6^2) = 0.36, no correct classification

In the first sample correct classification got larger loss then the second sample although that in the second sample there are no correct classifications.

Better loss function for classification problems is the Binary Cross Entropy that is better for binary classification as we saw in the course.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [2]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H),
        nn.Sigmoid(),
    ]*N,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer**:

The cause is most likely a result of **vanishing gradients**. The model presented in the question above has N adjacent linear layers that use the sigmoid activation function, this is considered a deep network that is prone to vanishing gradients. 
Another thing that we noticed is that the number of pirates axis got large numbers, therefore resulting in the activation function to reach its flat region that corresponds to small gradients.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer**:

Our friend advice is not so helpful.
the activation function tanh is just a rescaled sigmoid function, meaning that it is also prone to vanishing gradients with large numbers as explained in question 2.
even so that the tanh function is scaled such that the effect of vanishing gradients will be delayed , it's likely that this "improvement" will be negligible considering the large scale of number of pirates axis, and the depth of the network.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
    1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    1. The gradient of ReLU is linear with its input when the input is positive.
    1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer**:

A. **False**.
even that ReLu itself doesn't normally cause vanishing gradients like sigmoid and tanh (becuase its gradient is 1 for positive values and 0 for non-positive), 
there are more factors to consider like network depth or other network layers may cause vanishing gradients.

B. **False**. the gradient of the ReLu function for postives value is a constant 1.
this may be consdierd linear by defintion but it doesn't follow the properties of linearity we usually want, so we decided to go with false.

C. **True**. During a forward step "dead" neurons can be created, since any negative neuron will be zeroed after the activation layer (by defintion of ReLu function). 
If this happens with multiple inputs, the neuron will be regarded as useless.
Its clear that also the gradients for this neurens will be 0, so the weight corresponding to this neuron will be negligible, making it "dead" in future training process as well.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer**:

This three descents algorythes update themselves in the direction of the maximal loss decrease, they differ on the calculation of the loss.

SGD: The loss is calculated using **one**, randomly chosen data sample.

GD: The loss is calculated using the average loss of **all** the data samples.


Mini-Batch SGD: A combination between both of them. The loss is calculated using an avarge of randomly chosen **batch of data**.

2. Regarding SGD and GD:
    1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
    2. In what cases can GD not be used at all?

**Answer**: 

A.two reasons SGD is more used beacuse:

1.as we explained before the SGD uses one data sample compared to GD using all the data, therefore using SGD is faster and requires less memory.

2.SGD has the abillity to escape from local minimum, because the loss function used is changed according to the randomly chosen instance.
Also, SGD updates are faster since its calculations are simpler and shorter. These factors makes SGD usually converge faster then GD.

B.
When we are using large data, it may be impossible to compute the backpropagation beacuse its may be impossible to store all the relevant loss functions and gradients in the memory
making regular GD impossible.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer**: 

We expect the number of iterations to decrease. Each update will be an update from larger batch, thus there will be less noise and precisnes on each iteration.
therefore, we expect the updates to be more precise, results in a decrese in the number of iterations needed to reach some good results.
Notice that less iterations doesn't necceserally mean less time, because each batch is larger so each calculation requires more time and memory.

4. For each of the following statements, state whether they're **true or false** and explain why.
    1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    1. SGD is less likely to get stuck in local minima, compared to GD.
    1. Training  with SGD requires more memory than with GD.
    1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer**:

A. False. we take the sample randomly from the data, so we dont have to train every sample.

B. False. SGD has bigger varince then GD becuase while GD compute always all the gradient from all the data SGD uses a single sample to calculate the gradient, therefore it has a larger variance than regular GD that always has the same loss function.

C. True. As we explained before regular GD uses the same loss function meaning that we can get stuck in a local minimum, SGD that has randomization that may help it escape from local minimum due to a larger variance and randomness in the gradients calculated.

D. False. SGD uses only one instance while GD requires more memory since it calculates all loss gradients and sums them up.

E. False. As we explained before, GD may get stuck in a local minimum since it has no randomized factor to help it escape.

F. False. Using momentum we may take larger steps so we may overstep the narrow ravine unlike Newton's method where we would take smaller steps and probably not miss the lowest point in the narrow ravine.

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
    1. **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer**:

False.
For polynomial function we can find its minimum/maximum by simple derivitive so we dont need gradienrs methods.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer**:

A.Exploding gradients are gradients that becomes too large to be able to convey meaningful information that can train the network.
ֿVanishing gradients are gradients that appear in beginning of the backpropagation and are unable to propagate into later stages without becoming negligible.


B. In deeper networks, chain reactions where some layers have large/small gradients can cause the total value to grow exponentially, since the backpropagation connects all the layers' gradients allowing this chain effect.


C. Let's say we have a network with K layers, each layer is a 1x1 matrix $w_i$. Working with the L2 as a loss function, given a sample x with  a classification y = 0 we receive a loss value of:
$l2=(w_1w_2...w_{K−1}w_Kx)^2 $
In the backpropagation, we calculate the derivative relative to the last layer, that means we calculate:
$\frac{\partial l2}{\partial wK} = 2(w_1w_2...w_{K−1}w_Kx)⋅w_1w_2...w_{K−1}x$
Let's assume that all the weights are equal, marked as V, we receive the total value of: $2V^(2K−1)x^2$. TIn order to get an exploding gradient we can choose V>1 and a vanishing gradient we can choose V<1, if we are using a deep enough network-large K.


D.In case that the weights are stable (relatively), we are assuming that we do have vanishing gradients because they are not big enough in order to change the current weights.But, if the the loss is not stable with no trend of improvement and the weights were to change dramatically, we can expect exploding gradients, which are too large to “carry” relevant information.



### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer**:
We will start with computing the derivatives of $l$ and $\hat{y}$

$\frac{\partial \hat{y}}{\partial x} = W_2*\varphi'*W_1$

$\frac{\partial \hat{y}}{\partial b_2} = 1$

$\frac{\partial \hat{y}}{\partial b_1} = W_2*\varphi'$

$\frac{\partial \hat{y}}{\partial W_2} = [\varphi*(W_1*x+b_1)]^t$

$\frac{\partial \hat{y}}{\partial W_1} = W_2*\varphi'*x^t$

$\frac{\partial l}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}*(1-\hat{y})}$

Now using the chain rule, such that $ \frac{\partial l}{\partial \hat{y}} *\frac{\partial \hat{y}}{\partial k} = \frac{\partial l}{\partial k}$ we are able to calculate that:
$\frac{\partial L}{\partial W_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'*x^t)+\lambda*W_1$

$\frac{\partial L}{\partial W_2} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *[\varphi(W_1+b_1)]^t)+W_2*\lambda$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})}$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'*W_1$




2. Given the following code snippet, implement the custom backward function `part4_affine_backward` in `hw4/answers.py` so that it passes the `assert`s.

In [3]:
from torch.autograd import Function

from hw4.answers import part4_affine_backward

N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)

def affine(X, W, b):
    return 0.5 * X @ W.T + b

class AffineLayerFunction(Function):
    @staticmethod
    def forward(ctx, X, W, b):
        result = affine(X, W, b)
        ctx.save_for_backward(X, W, b)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        return part4_affine_backward(ctx, grad_output)

l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
print(l1.backward())
W_grad1 = W.grad
b_grad1 = b.grad

l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad

assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)

None


### Sequence models

1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer**:

A. A word embedding is a learned representation for text where words that have the same meaning have a similar representation. That can be achieved by encoding the words in a language as numerical vectors. In a way that semantically closer word will have closer encodings as well. This approach allows us not only to represent  the words in a compact form but also preserves their semantic meaning. When working with a model that its input is a language, there is a large significance in preserving the semantic meaning because very different words may have similar semantic meaning, so we want their embedding to be close too.


B. Trying to train a model without word embedding will be very challenging. Because emotion can be depicted in very specific words. If we were to loose the word that provided us with the ability to create a spectrum of emotions, it would be hard to create a network, rather impossible to create, that can react specifically to the words without using the encoded data's value compared to other words.

2. Considering the following snippet, explain:
    1. What does `Y` contain? why this output shape?
    2. How you would implement `nn.Embedding` yourself using only torch tensors. 

In [4]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer**:


A. Y: is containing the embedded codes of the vocabulary that is in X. X’s shape  is (5, 6, 7, 8)  containing  values from the range (0, 41). Y embedding dimension is 42000, which is the size of the embedding vector, that means that every value in x is embedded by a vector sized 42000.

B. We can simply create an nn.Embedding using the following matrix which its size is
vocabulary−size times embedding−dimension, every word can be represented as a number from zero to the number of words that there is in the dictionary - 1. The embedded value is the corresponding row within the matrix.


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of $S$: State whether the following sentences are **true or false**, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length $S$.
    3. TBPTT allows the model to learn relations between input that are at most $S$ timesteps apart.

**Answer**:


a. The answer is T, since the computation is the same for both algorithms, there is a difference which is that the TBPTT does the backpropagation for a constant number of steps.

b. The answer is F. It is not enough to only limit the length of the sequence provided to the model, we also have to determine the length of the truncation, by that we mean, to determine how many timesteps to look at when backpropagating.

c. The answer is F. In short, hidden state, more elaborately, the hs is still holding information across sequences, that means that inputs will be indirectly affected by all of the previous inputs that gone through the state.


### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


**Answer**:


A. The addition of the attention to the encoder and decoder cause that the hidden states from the encoder and decoder are more likley to be focused on important part of the sequence,
beacuse the decoder has the attentions context added to the input.
Also, differ to the model without attention we now dont have to deal with the problem of hidden states losing context after long sequences because past hidden states of the encoder are saved.

B.The influnce to the encoder is that now all the states are saved therefore each hidden state can represent a part of the sequence.
The influnce to the decoder is that no data is saved therefore each hidden state will have to incorprate the meaning of whole sequence.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    1. Images generated by the model ($z \to x'$)?

**Answer**:

A. We are using the KL divergence in order to make the probability distributions q(z|x) and q(z|x) and as similar as possible. If we were not to use this divergence and only the reconstruction loss, we may get very similar images between the reconstructed images and the encoded images, that will happen since we try to minimize the reconstruction loss.


B. During the generation, we are using the probability distribution approximated when  we  are training. Because we are not using the  KL divergence to calculate the probability distribution's loss, we will get that the distribution used to generate the images will be very different from the optimal distribution, that will lead to the fact that the generated images will more likely have a poor quality.


2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer**:

A. The answer if False. The latent-space distribution generated by the model for a specific input image is 
$\mathcal{N}(\mat{\mu}_{\alpha}(\mat{x}), \mat{\Sigma}_{\alpha}(\mat{x})$

B. The answer if False.We will not get the same result since we are using a probability distribution in order to encode and decode the images, that means the reconstructed images will most likely change from time to time.

C. The answer is True. Calculating the VAE loss will cause the need of calculating an intractable integral because it requires knowing the evidence distribution. But, instead we can use the Evidence lower bound loss function  in order to to minimize the maximum loss created.


2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
     5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer**:

A. The answer if F. The latent-space distribution generated by the model for a specific input image is 
$\mathcal{N}(\mat{\mu}_{\alpha}(\mat{x}), \mat{\Sigma}_{\alpha}(\mat{x})$

B. The answer if F.We will not get the same result since we are using a probability distribution in order to encode and decode the images, that means the reconstructed images will most likely change from time to time.

C. The answer is T. Calculating the VAE loss will cause the need of calculating an intractable integral because it requires knowing the evidence distribution. But, instead we can use the Evidence lower bound loss function  in order to to minimize the maximum loss created.
 
A. The answer is F. No, since having a high discriminator loss could result in it not being good enough to train in parallel with the generator, since it does not discriminate as good as the generator produces results, it will not be good enough to make the generator need to improve.

B. The answer is F. No, actually we do not update one model's parameter while training the other. To be more specific, while training the discriminator, we fixate on the generator model, which means that backpropagation into it isn't necessary.

C. The answer is T. This is how we would generate images. While training, we are creating a latent space distribution that is normally distributed, this provides us with the ability to sample it, and  map it to an image.

D. The answer is T. If we have trained the discriminator on the dataset beforehand it would have helped in differentiating between relevant ones and random images, that will provide us with better results in the first few epochs, instead of  having a  flipping coin success rate, we  will get a better assessment for our generator. That will lead to acceleration in the training  process.

E. The answer is F. When the discriminator reaches a stable state that means it is not able to improve anymore, that happens when it has an accuracy of fifty percent, which at this point means that it is good at guessing the authenticity of the image. As a result of the lack of ability of the discriminator to differentiate between  the generated and the original images, the generator is not able to improve anymore  using this discriminator, therefore  further training is futile.


### Graph Neural Networks

1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes:
$$
\mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right).
$$
    1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
    1. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph).
What would be the effect of this bug on the output of your layer, $\mat{Y}$?

**Answer**:

A. Y rows contains the output feature map for the node in the same row in X.
The output feature map is the weighted sum $$
 \mat{\Delta}^k \mat{x}^l
$$


B.After zero the 5 row in the Laplacian matrix the output feature of node 5 will be zero

B. Zeroing the 5th row in the Laplacian matrix means that fifth row of $$
 \mat{\Delta}^k
$$

zero as well that makes the output features of the 5th node always$$
\varphi\left( \vec{b} \right).
$$
This will make the output feature map of node 5 negligeble.
Also, if we zero the 5 col in laplacian matrix the 5 node will not affect the calculations of other nodes weights when avarging with the power of the laplacian, so the output features are calculated with the fifth node but not taking it when we do the avarge.

2. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?

**Answer**:

While receptive fields in CNN are based on the closness of pixels in the image, GCN receptive fields are based on the geomtrical structure of the graph, i.e we take into account the connection between nodes.
The receptive field for every output feature is the spatial extent of the node in last layers that affected this feature(k-ring of the node).
