<a href="https://colab.research.google.com/github/artman-industries/objet-detection-on-TACO-dataset.-deep-learning-project-technion/blob/main/Part2_SummaryQuestions%5B1%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 2: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

a receptive field refers to the area of the input image that affects the activation of a particular neuron in a convolutional layer.
In a CNN, the input image is convolved with a set of learnable filters to generate a set of feature maps. Each element in a feature map corresponds to the activation of a particular neuron, which is connected to a small local region of the input image, known as the receptive field.


2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

The size of the receptive field depends on the following: size of the filters used in the convolutional layer, pooling layers, dilated convolutions and the stride of the convolution operation (and even more). As requested in the question, we will explain 3 of them:
1.	Kernel size: as the kernel size grows, the receptive field grows as well because each neuron is dependent on more neurons. For example, with the maximum kernel size (fully connected layer) the receptive field is also the maximum because each neuron is dependent on all the other neurons.
2.	Stride: we can reduce the feature overlap the neurons have, resulting with a higher receptive field.
3.	Pooling layers: Max pooling, for example, replaces each 2x2 region of the feature map with its maximum value, effectively reducing the spatial size of the feature map by a factor of 2. There for, the next neurons will depend on smaller receptive field (only on the field of selected neuron – the one with the maximum value at our example). 


3. Imagine a CNN with three convolutional layers, defined as follows:

In [None]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

Let's start with the input layer which has a kernel size of 3x3 and padding of 1, so the effective kernel size is 3x3. Since there is no stride specified, we assume the stride is 1. Therefore, the receptive field size of the first convolutional layer is also 3x3.
The max pooling layer with a pooling size of 2x2 and stride of 2 effectively downsamples the spatial dimensions by a factor of 2. So, after the first max pooling layer, the receptive field size doubles to 6x6.
The second convolutional layer has a kernel size of 5x5, stride of 2, and padding of 2. The effective kernel size is 5x5, and since the stride is 2, the receptive field size is 6 + (5-1) * 1 = 10.
The second max pooling layer with a pooling size of 2x2 and stride of 2 again downsamples the spatial dimensions by a factor of 2. So, after the second max pooling layer, the receptive field size doubles again to 22x22.
The third convolutional layer has a kernel size of 7x7, dilation of 2, and padding of 3. The effective kernel size is (7-1)*2 + 1 = 13, since the dilation rate is 2. Therefore, the receptive field size of the last convolutional layer is 54x54.


4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

We assume that there are 2 main reasons for this phenomenon.
1.	Residual skip connection can prevent vanishing gradients, there for, the new network can learn more meaningful representations due to stable gradient calculations.
2.	In the case that the gradient calculation was stable (no vanishing gradients) the residual network can still learn different representation because the skip connection enables the network to learn both short-range and long-range dependencies in the data.


### Dropout

1. Consider the following neural network:

In [None]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

After the first dropout, we drop p_1 of the input. In the second one we left with 1-p_1 of the input and the dropout in this time is p_2. There for:
q=p_1+p_2∙(1-p_1 )=0.1+0.2∙0.1= 0.12


2. **True or false**: dropout must be placed only after the activation function.

False.

Although the runtime is reduced if the dropout layer is placed before the activation function (less activations needed to be calculated), there is no mathematical difference – there for, this is not mandatory.


3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

Let’s mark x as the input to a dropout layer with p as the probability of a neuron being dropped out, and y be the output after dropout.
Then: E[y]=E[(1-p)x]=(1-p)E[x]
There for: 1/(1-p) E[y]=E[x]


### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

I wouldn’t choose MSE as the loss function for this scenario and instead would use cross entropy loss. Cross entropy loss can refer to the data distribution while MSE can’t. 
although, for 1 class (only dogs), MSE can work as a loss function and there is the option to check it empirically if it gets the same results as cross entropy loss. 


2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/PiratesVsTemp%28en%29.svg/1200px-PiratesVsTemp%28en%29.svg.png?20110518040647" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [None]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

Vanishing gradients. Due to very deep network and the use of sigmoid function as the activation function, the gradients maybe vanished. around the values of our data (numbers of pirates) the slope of the sigmoid function is numerical zero and there for the gradients are vanished.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

He is a bad friend. His suggestion won’t help us. The reason his idea won’t work is the same as before. It will cause the gradients to vanish (the slop of the tanh is also numerical zero at these values).

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

1.	In a model using exclusively ReLU activations, there can be no vanishing gradients: F.
due to the chain rule, if the input for a neuron is small, it can cause the gradient to vanish (because the ReLU derivative is 0 or 1 and for positive values it preserves the input). Another cause for vanishing gradients while using ReLU could be initialize small parameters.
2.	The gradient of ReLU is linear with its input when the input is positive: F.
the derivative of the ReLU function when the input is positive is 1, there for constant function and there for, not linear.
3.	ReLU can cause "dead" neurons, i.e., activations that remain at a constant value of zero: T.
negative inputs for the ReLU function zero out the output of the function and there for the gradient will also be zero (the chain rule is applied). That will cause the neurons to zero out (“die”).


### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

GD computing the gradient of the loss function with respect to the model parameters for the entire training set, and then updating the parameters in the direction of the negative gradient.
SGD is a variation of GD that randomly selects a single training example from the dataset at each iteration and updates the model parameters based on the gradient of the loss function with respect to that example.
mini batch SGD is a compromise between GD and SGD. It works by randomly selecting a subset of training examples, called a mini-batch, at each iteration and updating the model parameters based on the gradient of the loss function with respect to that mini-batch.


2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

Provide at least two reasons for why SGD is used more often in practice compared to GD:

1.	for large datasets, GD can be computationally expensive and slow.
2.	SGD can help avoid overfit by when there is some noise in the dataset. The decent steps with noise can help to avoid local minima.

In what cases can GD not be used at all?

For tasks without a dataset (like RL tasks). Or for tasks with huge datasets (like imagenet) that are too big to use the entire dataset.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Increasing the mini-batch size from $B$ to $2B$ can have different effects on the number of iterations required to converge to the same loss value $l_0$, depending on several factors.
A bigger batch size will reduce the number of iterations for the same number of epochs (with the previous batch size). That is why our intuition was that it will take less time to converge to the same loss. But there are more several things to take into consideration. 
One of the main issues is the possibility of overfitting to the batch and result with more iterations to get the same loss value as before.  


4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

1.	When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset. T.
this is true by the definition of SGD.
2.	Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD. F.
SGD’s have more variance because at each step the loss is calculated differently.

3.	SGD is less likely to get stuck in local minima, compared to GD. T.
for the same reason as mention in the previous sentence. (SDG has more variance).
4.	Training with SGD requires more memory than with GD. F.
it requires less memory because at each step only 1 example is used compared to GD that uses the entire dataset.
5.	Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum. F.
because the loss function is not convex there is no guarantee to get to global minima. GD stops when the gradients are zeros (mathematically that means that the point is minima – local or global).
6.	Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

This statement is generally true, but it depends on the specific characteristics of the loss surface and the parameters of the optimization algorithms being used. In general, SGD with momentum can converge more quickly than Newton's method on a loss surface with a narrow ravine because the momentum term helps the optimizer to move more smoothly through the narrow ravine.
In a narrow ravine, the curvature of the loss surface is high in one direction and low in the other directions. This means that if the optimizer moves too quickly in the direction with high curvature, it may overshoot the minimum and fail to converge. On the other hand, if it moves too slowly, it may get stuck in the ravine and take a long time to converge.
SGD with momentum helps to overcome these issues by taking into account the previous gradients to build up a momentum term that helps the optimizer move more smoothly through the narrow ravine. This can allow it to converge more quickly than Newton's method, which doesn't have momentum and may struggle to move smoothly through the ravine.


5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

False.
A gradient for the layer can be calculated given the minimizer and the layer's input, as it is given by $∇_ {z}argmin_{y}\{f(y,z)\}(z) = -\nabla_{yy}f(y,z)^{-1}\nabla_{yz}f(y,z)$. This means that it doesn't really matter how we get to the solution, as a closed form solution and a steepest descent method are both viable.


6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Explain the concepts of "vanishing gradients", and "exploding gradients".
vanishing gradients - They are encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training
exploding gradients – this is the same phenomenon as vanishing gradients, but the only difference is that in this case, These gradients are used to update the weights. If the gradients are large, the multiplication of these gradients will become huge over time. This results in the model being unable to learn and its behavior becomes unstable.



How can each of these problems be caused by increased depth?
Both vanishing gradients and exploding gradients can be caused by increased depth in a neural network.
Vanishing gradients are more likely to occur in deep neural networks, where the gradient signal becomes weaker and weaker as it is propagated backwards through the network during training. This can happen because the gradients of the lower layers of the network are multiplied repeatedly by the weight matrices of the higher layers, causing the gradients to shrink exponentially as they move closer to the input layer. When the gradients become too small, they can no longer effectively update the weights of the lower layers, leading to slow convergence or even getting stuck in local minima.
On the other hand, exploding gradients can also occur in deep neural networks due to the same phenomenon, but in reverse. When the gradient signal becomes too large, the weights of the network can be updated with excessively large steps, leading to instability or divergence during training. This can happen when the weight matrices are initialized with large values, or when the activation functions of the network have large derivatives, or when the learning rate is too high.
Both vanishing and exploding gradients are detrimental to the learning process of a neural network and can prevent it from converging to an optimal solution.



Provide a numerical example demonstrating each.
let’s look at MLP with depth L and width=1 for each layer.
let’s mark the parameter of the layer i as a_i.
the gradient with respect to x is a_1…a_L x.
so for very big L and all parameters between 0 to 1 the gradient will be numerical zero (vanishing gradient).
for all parameters bigger than 2 the gradient will grow exponentially and explode.





Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Vanishing gradients occur when the gradient values become very small during the backpropagation process, making it difficult for the model to learn. Some signs that may indicate a vanishing gradient problem are:
The model's training loss is not decreasing or plateauing early in the training process, suggesting that the model is not learning.
The model's accuracy is not improving, or it is improving very slowly.
The model's output saturates at a certain point and becomes insensitive to changes in the input.

Exploding gradients, on the other hand, occur when the gradient values become very large during the backpropagation process, causing the model to diverge. Some signs that may indicate an exploding gradient problem are:
The model's training loss is unstable, oscillating or increasing rapidly, suggesting that the model is not converging.
The model's accuracy drops abruptly, suggesting that the model is overfitting or diverging.
The model's weights become very large or NaN, indicating that the model is unstable.


### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [None]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
# grad_W =...
# grad_b =...

# TODO: Compare with autograd using torch.allclose()
# autograd_W = ...
# autograd_b = ...
# assert torch.allclose(grad_W, autograd_W)
# assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.9033, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

Answers:
2. word embeddings is a way to represent words as a dense vector, where each word is represented by a uniqe vector.
it is used for language model, because it can be used for represnt words as input for the  NN model.
3. yes, the performence might be less good, because the embedding layer halps to capture semnatic relationships between words, so the model will generlize better on words that he didnt saw befor.


2. Considering the following snippet, explain:
3. What does `Y` contain? why this output shape?
4. How you would implement `nn.Embedding` yourself using only torch tensors.

Answers:
3. Y contain the embeddings of X. Y size will be is (5,6,7,8,42000). the first four dimension corresponds to the original vector, and the last value to the size of the embedding vector.
4. create a lookup table and use the input indices to index the corresponding embedding vector.

In [None]:
!pip install torch





In [None]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

NameError: name 'torch' is not defined

5. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
6. TBPTT uses a modified version of the backpropagation algorithm.
7. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
8. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

Answers:
6. True: TBPTT is a modified version of backpropagation, the difference between is that TBPTT updates the parameters of the model based on a truncated version of the sequence(which is limited to a fixed number of timesteps insted of the entire sequence). so the computational cost reduced andthan it is possible to make more efficient gradient updates.
7. False: we can instead  divide longer sequences into smaller chunks of length S and feed them to the model sequentially.
8. True: TBPTT limits the length of the sequence used for gradients to S timesteps, so thats limit the model's ability to learn dependencies beyond that length.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


Answers:
2. the addition of the attention mechanism, allows the decoder to sekectively give more weight to different part of the input sentece for generationg the output sentence, resulting in context-dependent hidden states. without attention, the decoders hidden states are based only on previously generated tokens.
3. self-attention with encoder hidden states as keys, queries, and values would help the model learn context-dependent representations of the source sentence, resulting in better alignment between source and target sentences. it could also capture longer-range dependencies.


### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

Answers:
2. the reconstructed images during training could still be generated, but they may be lower quality or less diverse than they would be with the KL-divergence term.
3. images generated by the model may be less diverse, or lower quality. images might be similar to the original dataset or repetitive.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

Answers:
3. True: the distribution for the latent space is usually assumed to be a standard normal distribution, which has a mean of zero and a variance of one.
4. False: the processe is stochastic. because of the random noise in the encoding process, the latent representation of the same imgae will be differente each time, so the output of the decoding will be different.
5. True: the loss includes the log-likelihood of the data given the model, which is impossible to calculate, so instead we calculate the lower bound on the log-likelihood.

6. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
7. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
8. It's crucial to backpropagate into the generator when training the discriminator.
9. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
10. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
11. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

Answers:
7. True: in GAN, the generator objective is to create images that are similar enough to real images so that the discriminator is fooled and classifying them as real. so we want the generator's loss to be low, what is indicating that the generated images looks real. and the discriminator's loss to be high, indicates that the discriminator is finding it harder to distinguish between the real and fake images.
8. False: in GANS the generator and the discriminator are trained sepaarately. the generator loss is calculated based on the discriminator output , but the gradients are not propagated back to the generator.
9. True: In GAN, the generator creates new images by sampling a random vector from a latent space, usually a gaussian distribution, zero mean, unit variance. The generator then maps the latent vector to an image in the data space.
10. True: in the begining of the discriminator training the output is almost random, so it can make the generator learning difficult.
11. False: if the discriminator reaches a stable state where it has 50% accuracy, it means that the discriminator is basically guessing between real and fake images. so keep training the generator will not lead to more realistic images.