$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 4: Summary Questions
<a id=part4></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**

Generally speaking, recpetive field is the area in the space which a neuron respons to.

In artificial nueral networks, spefically in CNN, it is the size of the region in the input that produces the feature.
It is the connection of an output feature of a nueorn in a layer to the input region.

In fully connected layer, each neuron is connected to all the input, so the receptive field is the whole input.
But in convolutions, each neuron is only connected to some of the neurons in the previous layer. Therefore the receptive field is often not the whole input, but part of it.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer:**

One way to control the receptive field is the kernel size of the convolution. The kernel is the block of input neurons which will be computed into the result. The kernel size is a cubic, with dimensions according to the convolution dimensions. As it is bigger, the receptive field is bigger, because more input neurons get into the computation.

Second way to control the receptive field is the dilation - this is the spacing between the values in a kernel. Dilation of 1 means no spacing, the kernel neurons are adjacent. Dilation bigger than 1 means there is spacing. Therefore as the dilation is bigger, the receptive field is bigger, because the actual kernel size is bigger, though with holes.

Third way to control the receptive field is the stride - this is the step of the kernel coordinates. As it is bigger, the output layer is smaller, and there is less overlap between the kernels. Therefore a neuron after two layers sees more of the input - its input from the first layer consists of the same number of neurons, but each neuron is with less overlapping input, which means with more input in general. Therefore as the stride is bigger, the receptive field is bigger.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [2]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer:**

Layer 1 Conv2d: each neuron sees a block of 3x3 pixels in the input, so 9 pixels.

Layer 2 MaxPool2d: each neuron sees a block of 2x2 from layer 1. Each such neuron sees 9 pixels, but most of these pixels are common between the kernels. So each neuron here sees a block of 4x4 = 16 pixels.

Layer 3 Conv2d: each neuron sees a block of 5x5 neurons in layer 2. So each neuron here sees a block of 6x6 = 36 pixels.

Layer 4 MaxPool2d: each neuron sees a block of 2x2 from layer 3. Without stride in layer 3, each neuron here sees a block of 7x7 pixels - 49 pixels. Considering the stride: (5\*2+1+1)x(5\*2+1+1) = 12x12 = 144 pixels.

Layer 5 Conv2d: each neuron sees a block of 7x7 from layer 4. Without stride in layer 3, each neuron here sees a block of 10x10 pixels - 100 pixels. Considering the dilation: (7\*2+1+3)x(7\*2+1+3) = 17x17 = 289 pixels.

So the receptive field of each pixel in the output tensor is 289 input pixels.

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:**

The learned filters are now different, because they have to train to different values. If the optimal value (the global minimum of the loss) as backpropagated to some layer is A, now it does not need to train to A but to A-x.

### Dropout

1. **True or false**: dropout must be placed only after the activation function.

**Answer:**

False. Dropout changes a neuron's result to 0 in the given probability. In the case of ReLU it does not matter because ReLU(0)=0. For exponantial activation, it maps 0 to -inf, so the dropout better be after the activation function. So the best place depends on the specific activation function.

2. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer:**

The expectation of the value x of an activation is the summation over all possible x values of x\*p(x). A fraction of p of those values are zeroed, so the expectation is the sum of x\*p(x) for the non-zeroed, which is (1-p) fraction of the original sum. Therefore scaling by 1/(1-p) normalizes it back to the original expectation.

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer:**

L2 is used when the data is distributed in Gaussian manner. But in binary classification there are only two options, so the distribution is not Gaussian, it is Bernoulli.

The expected results are 0 or 1, and will be encoded as one-hot vectors: [1;0] for dog and [0;1] for hotdog.

The last layer will be softmax, which will normalize the output to vector with sum of 1.

For example of [0;1]:

Output of [0.5;0.5] - L2=0.5.

Output of [0.8;0.2] - L2=0.68.

Output of [1;0] - L2=2.

We would like the loss of [1;0] to be much higher, because it is the opposite of the desired probability.

A better loss function for this case is the cross entropy function, which punishes much more for distant results, but still gives small values for results close to the truth.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H),
        nn.Sigmoid(),
    ]*N,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer:**

One possible reason is that the loss is stuck in a local minimum and can not get out of it.

Another possible reason is the vanishing gradient problem, which is caused because the network is very deep and because the activation function (sigmoid) deriviative have small values. Each layer makes the gradient smaller, because the backpropagation multiplies the gradient in the chain rule. Therefore the gradients are getting smaller and smaller as they propagate to the earlier layers, until they barely make the weights change.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:**

He is incorrert. Tanh has the same problem as sigmoid - small deriviatives.

But other activation functions like ReLU will prevent this problem.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
    1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    1. The gradient of ReLU is linear with its input when the input is positive.
    1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer:**

A. False - there might be vanishing gradients even with ReLU, in the case of negative values, because the deriviative in this case is 0. In such case, there will be no change in the gradient descent step, and the weight will be constant.

B. False - when the input is positive ReLU(x) = x, so the deriviative is 1, which is constant and not linear.

C. True - as described in A, when the input is negative or zero, ReLU(x) = 0. Therefore the deriviative is 0, and there will be no change in the gradient descent step, the weight will be constant negative, and after the activation it will continue to output value 0 and be "dead".

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**

GD performs the forward pass for all the train set, the loss is computed over all of it, and then it is backpropagated.

SGD performs the forward pass for each of the examples in the training set, and backpropagates the loss after each example.

Mini-batch SGD is the middle - performs the forward pass for a subset of the examples, a batch, backpropagets their common loss, and then continues to the next batches.

2. Regarding SGD and GD:
    1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
    2. In what cases can GD not be used at all?

**Answer:**

A.
One reason is that GD is very slow - it goes over all of the train set for one step of backpropagation, which is often very large. SGD does that after each example (or after each subset in the case of mini-batch SGD), and therefore the network trains with it much faster.
Second reason is that when using SGD, only one example (or subset of the training set in the case of mini-batch SGD) has to fit in the memory in the forward pass, in cotrast to GD where all the examples have to fit in memory together.

B.
As the second reason from section A - when the training set is too big, it can not fit in the network in memory all together, and has to be splitted at least to mini batches.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**

We would expect the number of iterations required to decrease, because each iteration now consists of more examples, and therefore the loss will be the average of more examples forward pass. Therefore the loss will be more describing of the network state, and the weights will be updated to a more accurate direction. Therefore each updating step will be more accurate, and therfore the number of required iterations will be smaller.

4. For each of the following statements, state whether they're **true or false** and explain why.
    1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    1. SGD is less likely to get stuck in local minima, compared to GD.
    1. Training  with SGD requires more memory than with GD.
    1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer:**

A. True - in each epoch each example in the training step goes through the network in a forward pass, and then for that example backpropagation is done and so an optimization step is done.

B. False - gradients with GD have less variance, because they are the average of all the examples instead of one example, which can be very different from one to the other.

C. True - stucking in local minima depends on the step size (learning rate) and direction (gradient). In SGD generally the direction is noisier, so when getting to local minima, SGD will not always go to the minimum because of the noisier direction. so it has bigger chance to get out of the local minima and not getting stuck in it. It even may avoid some of them because it does not always go to the minimum, so small minimums will affect it less.

D. False - training with SGD requires less memory, because less examples need to fit in the network in the memory together - only one instead of the entire training set in GD.

E. False - both can stuck in local minimum, and both can find the global minimum. It depends on the specific situation. GD is not guaranteed to converge to the global minimum, because it can stuck in a local minimum just like SGD and even more. Both of them will probably converge to local minimum, which may be the global.

F. True - momentum is the average of the last few steps. Therefore generally it helps in finding the right direction and reduce the noise. Therefore it helps in converging more quickly.

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
    1. **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer:**

False - as mentioned in the tutorial, the inner optimization problem is solved in the forward pass. This is done with a solver or with a closed form expression:

When z is the inner problem and y is the outer problem:

$\delta z = -R^T * K^{-1} * \delta y$,

where:

$K = \nabla ^2 _{yy} f(y,z) dy$

$R = \nabla ^2 _{yz} f(y,z) dz$

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer:**

A. The vanishing gradient problem is caused when the network is very deep and the activation function deriviative have small values. Each layer makes the gradient smaller, because the backpropagation multiplies the gradient in the chain rule. Therefore the gradients are getting smaller and smaller exponentially as they propagate to the earlier layers, until they barely make the weights change.

The exploding gradients problem is the opposite - the gradients are getting bigger as the backpropagation proceeds. In the chain rule the deriviatives are multiplied, so if the gradients are big, they get bigger and bigger until they "explode" - getting too big.

B. In both cases, as the network is deeper, there are more steps in the backpropagtion and more multiplications in the chain rule of the gradient. So deep network cause the problem in both cases: When they are small, less than 1, they will keep getting smaller and cause vanishing gradients. When they are large, they will keep getting larger and cause exploding gradients.

C. Vanishing: A network with activation functions that are logistic sigmoid - the gradient is e^x/(1+e^x)^2 which is always smaller than 1. After n layers, it will be e^nx/(1+e^x)^2n, which goes to 0 as n increases.

Exploding: A network with activation functions that are ReLU - the gradient is 1, but the weights may be very large. If the starting weights are 2, after n layers the gradients will be 2^n, which goes to infinity as n increases.

D. In vanishing gradients, the loss is barely changins, because the gradients are close to 0, so the weights are almost constant. In exploding gradients the network is very unstable, so the loss changes dramatically often. So by examining the loss we can tell which of the problems we have.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer:**

Notation: m is "y hat".

$m'(W2) = \varphi (W1 * x + b1) + b2$

$m'(b2) = W2 * \varphi (W1 * x + b1) + 1$

$m'(W1) = m'(f) * \varphi '(W1) = W2 * x$

$m'(b1) = m'(f) * \varphi '(b1) = W2 * 1$

$m'(x) = m'(f) * \varphi '(x) = W2 * W1$

$L'(m) = -y / log(m) - (1-y) / log (1 - m)$

$L'(W2) = L'(m) * m'(W2) + \lambda * W2 = (-y / log(m) - (1-y) / log (1 - m)) * (\varphi (W1 * x + b1) + b2) + \lambda * W2$

$L'(W1) = L'(m) * m'(W1) + \lambda * W1 = (-y / log(m) - (1-y) / log (1 - m)) * m'(f) * \varphi '(W1) = W2 * x + \lambda * W1$

$L'(b2) = L'(m) * m'(b2) = (-y / log(m) - (1-y) / log (1 - m)) * (W2 * \varphi (W1 * x + b1) + 1)$

$L'(b1) = L'(m) * m'(b1) = (-y / log(m) - (1-y) / log (1 - m)) * W2$

$L'(x) = L'(m) * m'(x) = (-y / log(m) - (1-y) / log (1 - m)) * W2 * W1$


2. Given the following code snippet, implement the custom backward function `part4_affine_backward` in `hw4/answers.py` so that it passes the `assert`s.

In [4]:
from torch.autograd import Function

from hw4.answers import part4_affine_backward

N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)

def affine(X, W, b):
    return 0.5 * X @ W.T + b

class AffineLayerFunction(Function):
    @staticmethod
    def forward(ctx, X, W, b):
        result = affine(X, W, b)
        ctx.save_for_backward(X, W, b)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        return part4_affine_backward(ctx, grad_output)

l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
l1.backward()
W_grad1 = W.grad
b_grad1 = b.grad

l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad

assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)

### Sequence models

1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer:**

A. Word embedding means representation of words for text analysis, most of the times in vectors of numbers. The embedding has to represent the word's meaning, such that the words with closer meaning have closer embeddings.

B. It is possible, but the results will not be as good as with embeddings, because there will be no connection between words in the manner of meaning - two words which are close in meaning very different in meaning but different in writing may be considered different. It is especially problematic with words that were not in the train set - they will be considered close to words from the training set with close writing, instead of close to words with close meaning.

2. Considering the following snippet, explain:
    1. What does `Y` contain? why this output shape?
    2. How you would implement `nn.Embedding` yourself using only torch tensors. 

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**

A. Y contains the embeddings of the numbers in X. The number of embeddings is 42, which means 42 "words" in the vocabulary, each embedding vector of size 42000. There are 5 * 6 * 7 * 8 = 1680 items in X, each of them gets an embedding vector of size 42000. Therefore each number in X gets a vector of size 42000 in Y.

B. We would create random (uniform distribution) 42 vectors of size 42000. Then we would sort them by their euclidian. In the embedding() function, each number gets one of the 42 embedding vectors according to its - the smallest gets the smallest embedding vector, the largest gets the largets embedding vector, and the rest get vectors in the middle according to their distance from the edges.

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of $S$: State whether the following sentences are **true or false**, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length $S$.
    3. TBPTT allows the model to learn relations between input that are at most $S$ timesteps apart.

**Answer:**

A. True - TBPTT is a modified version of BPTT (Backpropagation Through Time), which is the version of backpropagation to recurrent neural network applied to sequence data like a time series. BPTT unrolls the network according to the time steps of input and output and then computes the gradients and update the weight. TBPTT is truncated BPTT, which means the number of timesteps we apply BPTT on is constant. Therefore we can say it is a modified version of backpropagation.

B. False - TBPTT has two parameters: (1) the number of forward pass timesteps after each we do update, and (2) the number of timesteps to do BPTT on. Parameter (2) has to be limited to a constant S. But we would also limit (1) to be less or equal to (2), so the backpropagation is done to all the forwarded steps.

C. False - the model has its inner state, the weights of the layers, which is influenced by all the timestep since the begining of training.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


**Answer:**

A. The attentions allows the decoder to look ay different parts of the source sequence while it is generating the target sequence. We treat the encoder's outputs as both keys and values, and use the decoder's hidden state as a query at each time step. That means the attention output is a weighed average of the encoder outputs which match the most to the current decoder state. The attention output is appended to the decoder input in the next timestep. In that way, the decoder can focus at different parts of the source sequence.

Generally speaking, it allows the decoder to be part of the input to the attention, which later is input to the decoder, and by that the decoder affects the input - that is the way it focuses on different parts of the source sequence. Without attention, there is the regular encoder-decoder Seq2Seq machine translator, which less good, because the decoder has no affect on the input - the encoder is the only one that computes it.

B. It will harm the results. The attention purpose is to let the decoder focus on different parts of the source sequence. But if all the input to the attention is from the encoder and nothing from the decoder, the decoder will not have influence on the attention and will not be able to focus on different parts.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    1. Images generated by the model ($z \to x'$)?

**Answer:**

The reconstruction term is meant for the network to reconstruct the original input after the encoding and decoding. As it is lower, the output is closer to the input.

The KL-divergence term is the regularization part of the loss - it keeps the result close to the distribution of a Gaussian. As it is lower, the model is closet to the target distribution.

A. If there is no KL term, the only term in the loss is the reconstruction term. Therefore the results will be very close to the original input - it will not be a generative model, becasue it will reconstruct the original values.

B. If there is no KL term, the network did not learn a distribution. Therefore given z, the output Decoder(z) will generate undefined output, with no specifig distribution. It may be similar to the training set data if z is close to the training set z values, and it may be something more randomly generated if z is far from the training set z values.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer:**

A. False - the distribution of the latent space describe the distribution each latent attribute. This distribution is not necessarily N(0,1), it depends on the input distribution.

B. False - the decoder is probablistic, in the manner that it learns the probability for each x given latent vector z - it learns p(x|z). Therefore when given the same input several times, leading to the same latent vector z several times, the decoder will sample x by the distribution p(x|z). Therefore the decoder output will not be always the same.

C. False - The distribution p(x|z) is intractable, so we approximate by another distribution q which is tractable. The distribution q has to be close to the distribution p - min(KL(q(z|x)||p(z|x))). This is part of the reconstruction term of the loss. Therefore q has to be as close to p as possible, but not necessarily an upper bound.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
     5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer:**

A. False - we would like the discriminator to not be able to discriminate authentic and fake items. Therefore its ideal results will not be 0% correct, but 50% correct. The results go into the generator, so 50% of the fake items it generates will ideally still be "identified" as fake. Therefore the generator's loss will not always go down, and the discriminator's loss will not always go up, but around 50% in both of them in the ideal case.

B. False - it is crucial generally for the generator to train, but not all the time. We can also train the discriminator part of the time by itself, without the generator. It may be done to give it a general direction before involving the generator.

C. False - the input for the generator is random vectors. The vectors for generating new items has to be from the same distribution as used in the training process. So if the learned random vectors were from this distribution, it has to be the same again - N(0,1) only if it is the distribution of the vectors that it trained on.

D. True - as mentioned in B, we may train the discriminator first for a few epochs to give it a general direction, increase its success. It may be beneficial in the fact that the generator will have less successes in the start in fooling the discriminator, and therefore train faster.

E. False - if the discriminator reaches a stable state where it has 50% accuracy, the generator can not learn from it anymore, so training it more will not improve it.

### Graph Neural Networks

1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes:
$$
\mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right).
$$
    1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
    1. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph).
What would be the effect of this bug on the output of your layer, $\mat{Y}$?

**Answer:**

A. Y is a vector-valued nodes field. Each column is a node, so each row is a nodes attribute.

B. It will stay zeroed after the sums in the term inside the activation function, and propagate zeros to other parts of the result matrix. It will affect mainly the relation of vertex 5 with other vertices in the result.

2. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?

**Answer:**

The receptive field may be the vertices that affect each vertex - in the output vertex V, how many vertices affected its result during the forward pass.