$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 2: Summary Questions
<a id=part2></a>

### Presentors: 
    

|              Name |             Id |             email |
|-------------------|----------------|------------------ |
|  Shachar Zafran | 319002721 | shaharzafran@campus.technion.ac.il |
|  Dorin Shteyman | 206721102 | dorin.sh@campus.technion.ac.il |

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**

In a CNN, a receptive field refers to the portion of the input image that a single neuron in a particular layer is sensitive to. It is determined by the size of the filters used in the convolution operation and the stride of the filter as it moves across the input image.

Neurons with small receptive fields learn fine-grained features, and neurons with larger receptive fields learn more global features. Controlling the size of the receptive field in each layer of the network helps to suit CNN architectures for specific image recognition tasks.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer:**


1. Pooling: reduces the resolution of the feature maps by aggregating adjacent pixels into a single value, thereby reducing the size of the feature map and the receptive field of the neurons in the subsequent layers. 

2. Smaller convolution filters: compute a more local interactions between adjacent pixels, which limits the receptive field size of the neurons in the next layers. 

3. Strides: skip some of the input pixels during the convolution operation, which also reduces the size of the feature map and the receptive field of the neurons in the next layers.

In terms of how they combine input features, pooling and strides emphasize global interactions between distant pixels, while small convolution filters emphasize local interactions between adjacent pixels. Pooling and strides are effective in capturing high-level features, while small convolution filters are effective in capturing fine-grained features. 

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer:**

the receptive field ($r_{i}$) of layer $i\in{1,2,3,4,5}$ is given by the formula:

$$ S_{i} = s * S_{i-1} $$

$$r_{i} = r_{i-1} + (k-1)*d*S_{i-1}$$

$$ r_0 = 1 , S_{0} = 1$$

where k, s, d are the kernel-size, stride, dialation of the current layer, respectively.

Therefore we recieve the following receptive fields sizes:


Layer 1 Conv2d:  $$r_{1} = 3 → 3x3=9$$

Layer 2 MaxPool2d:  $$r_{2} = 3 + (2-1)*1*1 = 4 → 4x4=16$$

Layer 3 Conv2d:  $$r_{3} = 4 + (5-1)*1*1 = 8 → 8x8=64$$

Layer 4 MaxPool2d:  $$r_{3} = 8 + (2-1)*1*2 = 10 → 10x10=100$$

Layer 5 Conv2d: $$r_{4} = 10 + (7-1)*2*2 = 34 → 34x34=1,156$$


So the receptive field of each pixel in the output tensor is $1,156$ input pixels.

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:**

residual connections provide alternative route for the gradient to propagate through the network, allowing the network to learn the residuals between the input and output. As a result of propagating gradients through this shortcut path, now the network learns different filters, in order to learn more distinctive features.

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:**

$ q = p1 + (1 - p1) * p2 → q = 0.1 + (1 - 0.9) * 0.2 $

**q = 0.28**

2. **True or false**: dropout must be placed only after the activation function.

**Answer:**

False. Dropout changes a neuron's weight to 0 by chance of the given probability. 
For instance, for ReLU we know that ReLU(0)=0. However, for exponantial activation, 0 is mapped to -inf, so the dropout better be after the activation function for reasons of computational efficiency to possibly carry less -inf values.
In conclusion, the best place depends on the specific activation function.

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer:**

The expectation of the value x of an activation is the summation over all possible x values of x\*p(x). A fraction of p of those values are zeroed, so the expectation is the sum of x\*p(x) for the non-zeroed, which is (1-p) fraction of the original sum. Therefore scaling by 1/(1-p) normalizes it back to the original expectation.

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer:**

L2 is usually used when the data is distributed in Gaussian manner. But in binary classification the distribution is not Gaussian, it's Bernoulli.

The expected will be encoded as one-hot vectors: [1;0] for dog and [0;1] for hotdog.

The last layer will be softmax, which will normalize the output to vector with sum of 1 to form a disribution over the labels.

For example: if the true class is hotdog - [0;1]:

Output of [0.5;0.5] - gives L2=0.5.

Output of [0.8;0.2] - gives L2=0.68.

Output of [1;0] - gives L2=2.

We would like the loss of [1;0] to be much higher, because it's the exact opposite of the desired probability.

A better loss function for this case is the **cross entropy loss function**, which punishes much more for distant results (due to it's log component), but still gives small values for results close to the truth.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/PiratesVsTemp%28en%29.svg/1200px-PiratesVsTemp%28en%29.svg.png?20110518040647" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer**
The most likely cause is the vanishing gradients problem. The sigmoid activation function we use between each layer is bounded between 0 and 1, and its derivative is also bounded between 0 and 0.25. As a result, when the gradient is backpropagated through many layers (because the network we define is very deep), it can become very small, leading to slow convergence or getting stuck in a local minimum.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:**

He is incorrert. Tanh has the same problem as sigmoid - small deriviatives who will cause again the vanishing gradiant problem.
However, other activation functions like ReLU will prevent this problem.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
      1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
      2. The gradient of ReLU is linear with its input when the input is positive.
      3. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer:**

A. False - there might be vanishing gradients even with ReLU, in the case of negative values, because the deriviative in this case is 0. In such case, there will be no change in the gradient descent step, and the weightד will be constant.

B. False - when the input is positive ReLU(x) = x, making the gradiant value a constant of 1, not linear.

C. True. ReLU activation function outputs zero for negative inputs, so it can cause neurons' activations to remain at a constant value of zero if the input to the neuron is always negative or zero, making the neurons "dead".

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**

Gradient descent updates the model parameters using the gradient of the loss function over the entire training dataset, making it computationally expensive and slow. 
Stochastic gradient descent updates the parameters using the gradient of the loss for a single training example, making it faster but more prone to unstable convergence.
Mini-batch SGD is a compromise between GD and SGD. It computes the gradient of the loss function over a small subset of the training examples at each iteration, which makes it both faster than GD and more stable than SGD.

2. Regarding SGD and GD:
      1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
      2. In what cases can GD not be used at all?

**Answer:**

A. SGD is used more often in practice compared to GD for several reasons. Firstly, SGD is computationally less expensive, since it only calculates the gradient for a single training example at each iteration. This makes it easier to scale up to larger datasets and more complex models. Secondly, SGD is less likely to get stuck in local minima than GD, since the stochasticity in the update process can help the algorithm escape from local minima.

B. Gradient Descent cannot be used when the loss function is not differentiable, when the model is non-differentiable, or when it is not feasible to calculate the gradient over the entire dataset due to computational/memory limitations.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**

When increasing the batch size from B to 2B, the number of iterations required to converge to loss value $l_0$ is likely to decrease. This is because using larger mini-batches reduces the variance of the gradient estimate, resulting in more stable updates and faster convergence. 

4. For each of the following statements, state whether they're **true or false** and explain why.
      1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
      1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
      1. SGD is less likely to get stuck in local minima, compared to GD.
      1. Training  with SGD requires more memory than with GD.
      1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
      1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer:**

A. False. In SGD, we perform one optimization step per mini-batch, not every sample.

B. True. This is because the SGD gradients they are based on a smaller number of samples compared to GD and therefore are less noisy.

C. True. SGD is less likely to get stuck in local minima, compared to GD because the random fluctuations introduced by the stochastic updates can help the algorithm escape from local minima and reach deeper ones.

D. False. Training with SGD requires less memory than with GD because we only need to store and update the gradients for a small mini-batch of samples at a time, whereas GD requires storing the gradients for the entire dataset.

E. False. Both SGD and GD are guaranteed to converge to a local minimum, not necessarily the global minimum. 

F. True. SGD with momentum is better suited for optimization problems with narrow ravines because it can accelerate in the direction of the gradient while damping oscillations in the orthogonal direction, which allows it to navigate the ravine more quickly and converge faster. On the other hand, Newton's method without momentum can get stuck in the ravine and converge more slowly.

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer:**

False - just like in the tutorial, the inner optimization problem can be solved in the forward pass. This is done with a solver or with a closed form expression:

When z is the inner problem and y is the outer problem:

$\delta z = -R^T * K^{-1} * \delta y$,

where:

$K = \nabla ^2 _{yy} f(y,z) dy$

$R = \nabla ^2 _{yz} f(y,z) dz$

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
      1. Explain the concepts of "vanishing gradients", and "exploding gradients".
      2. How can each of these problems be caused by increased depth?
      3. Provide a numerical example demonstrating each.
      4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer:**
1. "Vanishing gradients" - when gradients of the loss function with respect to the network parameters become very small, resulting in non-convergence of the training process. "Exploding gradients" - when gradients become very large, leading to instability during training.

2. Increased depth in a neural network can cause vanishing gradients because the gradient signal can decay exponentially as it propagates through multiple layers. Exploding gradients can be caused by increased depth when the weights of the network are initialized to large values, making the gradients to lager and larger during training.

3. Vanishing Gradients: A network logistic sigmoid as activation functions - the gradient is $e^x/(1+e^x)^2$ which is always smaller than 1. After n layers, it will be $e^nx/(1+e^x)^2n$, which goes to 0 as n increases. Exploding Gradients: A network ReLU as activation functions - the gradient is 1, but the weights may be very large. If the starting weights are 2, after n layers the gradients will be $2^n$, which goes to infinity as n increases.

4. One way to tell if the problem is vanishing or exploding gradients is to monitor the loss function during training; if the loss function does not improve or decreases very slowly, it's the vanishing gradients problem, and if the loss function fluctuates or increases fast, it's the exploding gradients problem.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer:**

First, let us calculate the partial derivative:

$\frac{\hat{y}}{dW2} = \varphi (W1 * x + b1)$

$\frac{\hat{y}}{db2} = 1$

$\frac{\hat{y}}{dW1} = \frac{\hat{y}}{d\varphi} * \frac{\varphi}{dW1} = W2 * x * \varphi '(W1 * x + b1)$

$\frac{\hat{y}}{db1} = \frac{\hat{y}}{d\varphi} * \frac{\varphi}{db1} = W2 * \varphi '(W1 * x + b1)$

$\frac{\hat{y}}{dx} = \frac{\hat{y}}{d\varphi} * \frac{\varphi}{dx} = W2 * W1 * \varphi '(W1 * x + b1)$

$\frac{L}{d\hat{y}} = \frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}$

$\frac{L}{dW2} = \frac{L}{d\hat{y}} * \frac{\hat{y}}{dW2} + \lambda * W_2 = (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) * (\varphi (W1 * x + b1)) + \lambda * W_2$

$\frac{L}{dW1} = \frac{L}{d\hat{y}} * \frac{\hat{y}}{dW1} + \lambda * W_1 = (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) *  \frac{\hat{y}}{d\varphi} * \frac{\varphi}{dW1} + \lambda * W_1 =  (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) * W2 * x * \varphi '(W1 * x + b1) + \lambda * W_1$

$\frac{L}{db2} = \frac{L}{d\hat{y}} * \frac{\hat{y}}{db2} = (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) $

$\frac{L}{db1} = \frac{L}{d\hat{y}} * \frac{\hat{y}}{db1} = (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) * W2 * \varphi '(W1 * x + b1)$

$\frac{L}{dx} = \frac{L}{d\hat{y}} * \frac{\hat{y}}{dx} = (\frac{-y}{log(\hat{y})} + \frac{(1-y)}{log (1 - \hat{y})}) * W2 * W1 * \varphi '(W1 * x + b1)$


2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
      1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).

      2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer:**

A. Using this formula, we can compute the numerical gradient of the loss function with respect to each parameter, and use it to update the network parameters using one of the gradient descent family of optimization algorithms.

B. Two major drawbacks are:
- Computationally expensive: numerical computation of the gradients requires evaluating the function at two perturbed values for each parameter, which can be computationally expensive especially for large neural networks with many parameters. 
- Less accurate: the accuracy of the computed gradients can be less than that obtained using automatic differentiation, since choosing the appropriate perturbation value can be challenging, leading to numerical errors or inaccurate approximations of the gradient.


3. Given the following code snippet:
      1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
      2. Calculate the same derivatives with autograd.
      3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b

# small epsilon value 
eps = 1e-6

# initialize tensors to store the gradients
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)

# computing the gradients for each element of W and b
for i in range(d):
    for j in range(d):
        # W[i, j]
        W_plus = W.clone()
        W_plus[i, j] += eps
        loss_plus = foo(W_plus, b)
        W_minus = W.clone()
        W_minus[i, j] -= eps
        loss_minus = foo(W_minus, b)
        grad_W[i, j] = (loss_plus - loss_minus) / (2 * eps)

    #  b[i]
    b_plus = b.clone()
    b_plus[i] += eps
    loss_plus = foo(W, b_plus)
    b_minus = b.clone()
    b_minus[i] -= eps
    loss_minus = foo(W, b_minus)
    grad_b[i] = (loss_plus - loss_minus) / (2 * eps)
    
# TODO: Compare with autograd using torch.allclose()
loss.backward()
autograd_W = W.grad
autograd_b = b.grad

assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.8527, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
      1. Explain this term and why it's used in the context of a language model.
      1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer:**

A. Word embeddings are vector representations of words in a high-dimensional space, used to capture linguistic relationships between words. They are used in language models to allow models to better capture the meaning and context of words and improve accuracy.

B. It is possible, but the results will not be comparable to those using embeddings. This is because there will be no connection in meaning between words - two words which are close in meaning but different in writing may be considered different. The problem is worse when it comes to words outside of the train set. These words will be considered close to words from the training set with close writing, instead of words with close meaning.

2. Considering the following snippet, explain:
      1. What does `Y` contain? why this output shape?
      2. How you would implement `nn.Embedding` yourself using only torch tensors. 

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**

A. Y contains the word embeddings of the integers in X. It's output shape is Y.shape = (5, 6, 7, 8, 42000) because X has shape (5, 6, 7, 8) and each integer in X is replaced with an embedding vector of size 42000 by the nn.Embedding layer.

B. We'd follow the these steps:
- initialize a random tensor with shape (num_embeddings, embedding_dim) called embedding_weights_tensors
- implement the forward pass by:
     - flattening the input tensor x to 1D tensor
     - indexing the flattened x vector into embedding_weights_tensors using torch.index_select
     - reshaping the result back to the original shape. 
     
The result tensor has shape (batch_size, sequence_length, embedding_dim), the same as the shape of the output tensor of the nn.Embedding layer.

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
      1. TBPTT uses a modified version of the backpropagation algorithm.
      2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
      3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer:**

A. True. TBPTT is a modified version of the standard backpropagation algorithm that is used to train recurrent neural networks (RNNs) by breaking down a long sequence into multiple shorter sequences.

B. False. To implement TBPTT, in addition to limiting the length of the sequence provided to the model to length S, we alse need to perform forward and backward passes on sub-sequences of length S. Also we need to save the hidden states of the end of each sub-sequence, to use it as the initial hidden states for the next sub-sequence.

C. True. TBPTT limits the length of the input sequence to S timesteps. Meaning, the model can only consider dependencies within S timesteps. Therefore, it can learn relations between inputs that are at most S timesteps apart.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
      1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?

      2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


**Answer:**

A. The addition of the attention mechanism in a sequence-to-sequence model allows the decoder to selectively imphasize different parts of the input sequence when generating each output token, thanks to the attention weights that determine the importance of each input position for the current output token. 
As a result, the hidden states learned by the encoder become more informative and fit for the input position
In contrast, in a model without attention, the encoder generates a fixed-length context vector that summarizes the entire input sequence. 
Therefore, we can see that models with attention better handle long input sequences and capture important details for the output sequence.

B. It will harm the results. The attention purpose is to let the decoder focus on different parts of the source sequence. But if all the input to the attention is from the encoder and nothing from the decoder, the decoder will not have influence on the attention and will not be able to focus on different parts.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

      1. Images reconstructed by the model during training ($x\to z \to x'$)?
      1. Images generated by the model ($z \to x'$)?

**Answer:**

A. The quality of the images reconstructed by the model during training may decrease. This is because the KL-divergence term is responsible for regularizing the latent space and encouraging the model to generate meaningful latent codes. Without KL-diveregence, term the reconstruction term will make the reconstructed images very close to the original input As a result, the model will be less generative and the output will be very close to the input data.

B. If there is no KL term, the network did not learn the underlying data distribution of the inputs. Therefore the output Decoder(z) will generate undefined output, with no specific distribution. It may be similar to the training set data if z is close to the training set z values, and it may be something more randomly generated if z is far from the training set z values.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
      1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
      2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
      3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer:**

A. False - The goal of VAEs is to learn this complex distribution to better capture the underlying structure of the data. Therefore, the distribution generated by the encoder is a complex distribution that is transformed from the prior normal distribution using a neural network.

B. False. Due to the stochastic sampling process used in VAEs, feeding the same image to the encoder multiple times and decoding each result will not necessarily result in the same reconstruction each time. This is because the decoder learns the probability for each x given latent vector z - it learns p(x|z) and will each time stochasticly sample a result from this distribution.

C. True. The KL-divergence term in the VAE loss is intractable, so instead of minimizing the exact term, we minimize a tractable upper bound on it. This is done in the hope that the bound is tight enough to result in good reconstructions and representative latent codes.

3. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
      1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
      2. It's crucial to backpropagate into the generator when training the discriminator.
      3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
      4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
      5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer:**

A. False - we would like the discriminator to not be able to discriminate authentic and fake items. Therefore its ideal results will not be 0% correct, but 50% correct. The results go into the generator, so 50% of the fake items it generates will ideally still be "identified" as fake. Therefore the generator's loss will not always go down, and the discriminator's loss will not always go up, but around 50% in both of them in the ideal and stable case.

B. False - False. The discriminator's loss is not backpropagated to the generator during training, as this can lead to the generator simply memorizing the training data instead of generating newly created images.

C. True - In GANs, it's typical to sample a latent-space vector from a prior distribution, such as a normal distribution N(0, I), and use this vector as input to the generator to produce a new image.

D. True - Pretraining the discriminator first for a few epochs to give it a general direction, increase is beneficial. It can help to provide some information about what features of the image are important to distinguish between real and generated images.

E. False - If the discriminator has reached a stable state where it has 50% accuracy, more training of the generator will not improve the generated images. In fact, it can even cause the generator to overfit the training data.