$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 2: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

======================================================================
ANSWER:

In Convolutional Neural Networks (CNNs), a receptive field refers to the portion of the input image that a single neuron in a layer is "looking at". Each neuron's receptive field is determined by the size of its convolutional kernel, the number of layers in the network, and the stride with which the kernel moves across the input image. The receptive field grows with each subsequent layer, as each neuron receives input from a larger region of the previous layer.



2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

======================================================================
ANSWER:

There are several ways to control the rate at which the receptive field grows from layer to layer in CNNs.
The first approach is to use smaller convolutional kernels (such as 3x3) and increase the number of layers in the network. This approach is called "deepening" and has the advantage of increasing the non-linearity of the network, since each layer introduces a non-linear activation function. By using smaller kernels, the receptive field grows more slowly, but more layers are needed to cover the same region of the input image.

The second approach is to use pooling layers between the convolutional layers. Pooling layers reduce the spatial dimensionality of the input, typically by taking the maximum or average value over a small region (such as 2x2) of the previous layer. This has the effect of increasing the receptive field of each neuron in the next layer, since they are now looking at a larger region of the input. However, pooling layers also reduce the resolution of the input, which can result in loss of information.

The third approach is to use dilated convolutions, also known as atrous convolutions. Dilated convolutions insert gaps between the values in the convolutional kernel, effectively increasing the size of the kernel without increasing the number of parameters. This has the effect of increasing the receptive field of each neuron, while still maintaining a high spatial resolution of the input. However, dilated convolutions can result in a more sparse representation of the input, which may reduce the performance of the network.

In terms of how they combine input features, deepening and dilated convolutions both combine input features in a local, dense manner. Pooling, on the other hand, combines features in a more global, sparse manner by taking the maximum or average value over a larger region of the input.



3. Imagine a CNN with three convolutional layers, defined as follows:

======================================================================
ANSWER

The CNN with three convolutional layers can be defined as follows:
Layer 1: 32 filters with a 3x3 kernel, ReLU activation, and padding
Layer 2: 64 filters with a 3x3 kernel, ReLU activation, and padding
Layer 3: 128 filters with a 3x3 kernel, ReLU activation, and padding
In layer 1, each filter will have a receptive field of 3x3 pixels, meaning each neuron is looking at a 3x3 patch of the input. In layer 2, each neuron will have a receptive field of 5x5 pixels, since it receives input from a 3x3 patch of the previous layer. Finally, in layer 3, each neuron will have a receptive field of 7x7 pixels, since it receives input from a 3x3 patch of the previous layer.

To interpret the performance of the network, one would need to consider the dataset being used, the objective of the task, and the evaluation metrics being used. However, in general, deeper networks with larger receptive fields tend to perform better on tasks that require a high degree of spatial abstraction, such as object recognition or semantic segmentation.



In [7]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

======================================================================
ANSWER:

In a Convolutional Neural Network (CNN), the size or spatial extent of the receptive field of each "pixel" in the output tensor depends on the architecture of the network, specifically the number of layers, the size of the convolutional kernels, and the stride with which they move across the input. As we move deeper into the network, each neuron's receptive field grows larger, meaning that it "sees" a larger portion of the input. This is because each neuron in a given layer is connected to a larger region of the previous layer, which is determined by the size of the kernel and the stride. 
The size (spatial extent) of the receptive field of each "pixel" in the output tensor can be computed as follows:
After the first convolutional layer with kernel size 3 and padding 1, the output tensor will have spatial dimensions of 1024x1024, and each pixel will have a receptive field of size 3x3.
After the first max pooling layer with kernel size 2, the output tensor will have spatial dimensions of 512x512, and each pixel will have a receptive field of size 4x4 (2x2 from the max pooling operation, and 3x3 from the previous convolution).
After the second convolutional layer with kernel size 5, stride 2, and padding 2, the output tensor will have spatial dimensions of 256x256, and each pixel will have a receptive field of size 12x12 (2x2 from the max pooling operation, and 5x5 from the previous convolution).
After the second max pooling layer with kernel size 2, the output tensor will have spatial dimensions of 128x128, and each pixel will have a receptive field of size 16x16 (2x2 from the max pooling operation, and 12x12 from the previous convolution).
After the third convolutional layer with kernel size 7, dilation 2, and padding 3, the output tensor will have spatial dimensions of 128x128, and each pixel will have a receptive field of size 36x36 (16x16 from the previous max pooling, and 7x7 from the previous convolution with dilation).




4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.


ANSWER: 


The reason for the observed differences in learned filters between the original CNN and the residual CNN lies in the way the residual connections affect the optimization process. In the original CNN, each layer is optimized to produce the desired output directly from its input, without any shortcuts or additional inputs. In the residual CNN, each layer is optimized to produce the desired output by adding the input to the result of its convolutional operation. This means that the optimization process in the residual CNN can take advantage of the residual connections to skip over difficult regions of the optimization landscape, and focus on learning more complex and meaningful representations. As a result, the learned filters in the residual CNN may be more diverse, specialized, and effective than those in the original CNN, since they can leverage both the input information and the residual information to improve their performance. However, this also means that the learned filters in the residual CNN may not be directly comparable or interpretable with those in the original CNN, since they represent different optimization objectives and strategies.



### Dropout

1. Consider the following neural network:

In [8]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

======================================================================
ANSWER:

In order to replace the two consecutive dropout layers with a single one, we need to find the equivalent drop probability q that would have the same effect as applying p1 and p2 consecutively. This can be computed as follows:

q = 1 - (1 - p1) * (1 - p2)

The idea behind this calculation is that the probability of keeping each unit active after applying two consecutive dropout layers with probabilities p1 and p2 is equal to the product of the individual keep probabilities, which is equivalent to the probability of dropping out each unit with probability q = 1 - keep_prob
======================================================================



2. **True or false**: dropout must be placed only after the activation function.

======================================================================
ANSWER:

False. Dropout can be placed both before or after the activation function, depending on the specific architecture and objectives of the neural network. In general, placing dropout before the activation function may be more effective in preventing overfitting, since it can reduce the co-adaptation between units and encourage more diverse and robust representations. However, placing dropout after the activation function may also be beneficial in some cases, since it can allow the units to learn more complex and expressive transformations without being excessively constrained by the dropout mask.
======================================================================


3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

======================================================================
ANSWER:

After applying dropout with a drop-probability of p, the activations are scaled by 1/(1-p) in order to maintain their expected value unchanged. To see why this is the case, consider a single activation a that is either kept with probability 1-p or set to zero with probability p. The expected value of this activation can be computed as follows:

E[a] = (1-p) * a + p * 0 = (1-p) * a

If we want to maintain the expected value of a unchanged after applying dropout, we need to scale it by 1/(1-p) to compensate for the reduction in the number of active units. This means that the actual activation a' after dropout will be given by:

a' = a * mask / (1-p)

where mask is a binary mask that determines which units are kept and which are dropped. By multiplying a by mask, we set the dropped units to zero and keep the active units unchanged, while the scaling factor of 1/(1-p) ensures that the expected value of a' is equal to the expected value of a before dropout.




### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?


======================================================================
ANSWER:

No, an L2 loss is not appropriate for training a binary classifier like the one described here. The L2 loss, also known as the mean squared error (MSE), is a regression loss function that measures the average squared difference between the predicted and target values. It is commonly used for problems where the output is a continuous variable, such as predicting a numeric value or a probability. However, for a binary classification problem like this, the output is a discrete variable with only two possible values, so using a regression loss like L2 would not be suitable.
The reason that using L2 loss is not appropriate for binary classification problems is that the output of the model is a probability distribution over the classes (in this case, dog and hotdog), rather than a continuous value. L2 loss is designed for continuous output values, and it tries to minimize the difference between the predicted and true values by penalizing the squared differences.

In binary classification, a common loss function to use is the binary cross-entropy loss, also known as log loss. This loss function is designed to measure the difference between two probability distributions, in this case, the predicted probability distribution and the true probability distribution. The binary cross-entropy loss works by taking the negative log likelihood of the predicted probability of the correct class.

Here's an example to illustrate why L2 loss is not appropriate for binary classification. Let's say we have a model that outputs a probability distribution over the classes, and we want to classify an image as either a cat (output 0) or a dog (output 1). The true label for an image is a dog, so the true probability distribution is [0, 1].

If we train the model with L2 loss, and the model outputs [0.5, 0.5], the L2 loss would be (0.5-0)^2 + (0.5-1)^2 = 0.5. However, this does not reflect the fact that the model is uncertain and doesn't strongly predict either class. In contrast, the binary cross-entropy loss would penalize the model for being uncertain and not strongly predicting the true class.

Instead, we can use a binary cross-entropy loss, which is a commonly used loss function for binary classification problems. The binary cross-entropy loss measures the difference between the predicted probability and the target probability for a binary classification problem. It is defined as:

L = -[y*log(p) + (1-y)*log(1-p)]



where y is the ground-truth label (0 for dog and 1 for hotdog), p is the predicted probability of the positive class (hotdog), and log is the natural logarithm.

To illustrate the difference between L2 and binary cross-entropy losses, consider a simple example where we have two training examples and their corresponding true labels and model predictions:

example	True Label	Model Prediction
  1	        0	         0.8
  2     	  1	         0.2


If we use an L2 loss to train the model, the loss would be computed as the mean squared error between the true labels and the predicted values:

L2 loss = (0 - 0.8)^2 + (1 - 0.2)^2 = 1.16

 This loss does not reflect the fact that the model is making correct predictions for both examples, but is just not confident in its predictions. Using an L2 loss could lead the model to assign equal weights to both examples, which could result in suboptimal performance.

On the other hand, if we use a binary cross-entropy loss to train the model, the loss would be computed as follows:

Binary cross-entropy loss =
 -[0*log(0.8) + (1-0)*log(1-0.8)] - [1*log(0.2) + (1-1)*log(1-0.2)] = 0.965

 This loss penalizes the model more for making incorrect predictions and rewards it more for making correct predictions with higher confidence. It is a more suitable loss function for binary classification problems.

In summary, L2 loss is not appropriate for binary classification problems because the output of the model is a probability distribution over the classes, and L2 loss is designed for continuous output values. Instead, binary cross-entropy loss is a better choice because it measures the difference between two probability distributions and penalizes the model for being uncertain and not strongly predicting the true class.

Therefore, we should use a binary cross-entropy loss to train a binary classifier like the one described in the problem statement.






2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/PiratesVsTemp%28en%29.svg/1200px-PiratesVsTemp%28en%29.svg.png?20110518040647" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [9]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

======================================================================
ANSWER:

The most likely cause for the plateau in loss after only a few iterations is the vanishing gradient problem. This problem arises when gradients in the backpropagation algorithm become too small to effectively update the weights in the earlier layers of the network. As a result, the weights in these layers remain largely unchanged, leading to a stagnant or plateauing training process.

In the given model, the repeated use of the Sigmoid activation function may be causing the vanishing gradient problem. The Sigmoid function has a maximum gradient of 0.25, which means that as backpropagation proceeds through the layers of the network, the gradients can become exponentially small. This makes it difficult to update the weights in the earlier layers of the network, and can lead to a plateau in training.

To address this issue, one potential solution is to use an activation function with a larger maximum gradient, such as the Rectified Linear Unit (ReLU). Another solution could be to use normalization techniques, such as Batch Normalization or Layer Normalization, which help stabilize the gradient flow through the network.

In addition, the architecture of the given model may be too deep, with 24 hidden layers. Deep neural networks are more prone to the vanishing gradient problem, especially when using certain activation functions. In this case, reducing the number of layers or using skip connections (such as in a ResNet architecture) could help alleviate the issue. 

Overall, the vanishing gradient problem is a well-known challenge in deep learning, and addressing it requires careful consideration of the model architecture and the activation functions used.


3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

======================================================================
ANSWER:

Is it correct to replace the sigmoid activations with tanh to solve the problem? Explain why or why not.
Replacing the sigmoid activations with tanh may or may not solve the problem of the plateau in loss during training. The tanh activation function is similar to sigmoid in that it is also sigmoidal in shape, but it is centered around zero and ranges from -1 to 1, instead of 0 to 1. The advantage of tanh over sigmoid is that it can output negative values, which can be useful in some situations.

However, in this case, the choice of activation function is unlikely to be the primary cause of the plateau in loss during training. The mlpirate model has a large number of layers (25), and the repeated use of the same activation function can lead to the saturation of the gradients. This saturation can cause the gradients to vanish or explode, making it difficult for the optimization algorithm to update the weights effectively. Therefore, replacing the activation function may not be sufficient to overcome this problem.


4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

======================================================================
ANSWER:

True or false: In a model using exclusively ReLU activations, there can be no vanishing gradients; The gradient of ReLU is linear with its input when the input is positive; ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

a) False. While ReLU activations are known to alleviate the problem of vanishing gradients, they can still occur in deep networks that use exclusively ReLU activations. When the input to a ReLU activation is negative, the gradient is zero, which can cause the gradients to vanish during backpropagation.

b) True. When the input to a ReLU activation is positive, the gradient is equal to 1, which means that the gradient is linear with respect to its input.

c) True. ReLU can cause "dead" neurons, which are neurons that always output zero, regardless of the input. This can happen if the bias term is set such that the weighted input is always negative. In this case, the gradient of the neuron is always zero, and the neuron remains inactive. Dead neurons can significantly reduce the capacity of a neural network and are often a problem in deep networks. One way to address this issue is to use variants of ReLU, such as leaky ReLU or ELU, which


### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).
Answer:
-Stochastic gradient descent (SGD) updates the model parameters by computing the gradient using only one randomly selected data point at a time, making the optimization noisy. 
-Regular gradient descent (GD) computes the gradient of the loss function with respect to the model parameters over the entire training set for each parameter update, making the optimization deterministic. 
-Mini-batch SGD is a compromise between SGD and GD, in which the gradient is computed using a mini-batch of data points, rather than just one, striking a balance between the noise of SGD and the determinism of GD. 


2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

Answer:
- SGD is used more often than GD in practice because it is computationally cheaper and requires less memory. Therefore, it is more scalable and can handle larger datasets more efficiently. Additionally, SGD often converges faster than GD if the learning rate and momentum are appropriately chosen.
- GD cannot be used in cases where the dataset is too large to fit in memory, as computing the gradient of the loss function with respect to the model parameters over the entire training set requires too much memory.


3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Answer:
We would expect the number of iterations to converge to $l_0$ should decrease when using the new mini-batch size, as the larger batch size has a more precise estimate of the gradient, which leads to a more informative update direction for the model parameters. When updating the model based on a larger number of training examples, this reduces the noise in gradient updates, which further improves convergence times. Additionally, with more memory available, computations can be done in parallel, further reducing the time it takes to train the model. 

4. For each of the following statements, state whether they're **true or false** and explain why.

Answer:

- False. SGD updates the model parameters once per mini-batch, meaning that only a fraction of the dataset is used in each epoch. 
- True. Since the gradients computed with SGD are based on a small subset of the dataset, they have a higher variance than the gradients computed with GD, which are based on the entire dataset. This higher variance leads to faster convergence. 
- True. The higher variance of SGD gradients means that the optimization can escape local minima more easily than GD. 
- False. GD requires more memory than SGD because it stores the entire training set in memory to compute gradients efficiently. 
- False. The optimization goal for both GD and SGD is to converge to a local minimum or saddle point, but both are not guaranteed to find the global minimum, even under appropriate learning rates. 
- True. SGD with momentum helps the optimization step better navigate narrow ravines, as it can pass through areas of high curvature more easily.


5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

  Answer:
False. The inner optimization problem can be solved using any optimization method, as long as it is differentiable with respect to the parameters of the neural network. 


6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?


  Answer: 
1. Vanishing gradients refer to the situation where gradients become very small as they flow backwards through the network, making earlier layers very difficult to train. Exploding gradients refer to the situation where gradients become very large as they flow backwards through the network, which can lead to instability during training.
2. Vanishing gradients can be caused by the repeated application of small weights in the traversal of several layers in deep models. Exploding gradients can be caused by large weights or activation values in the forward propagation of deep models. Both these problems can happen when the model's depth increases.
3. If we have a deep network with many layers, and the weights are initialized with small values, vanishing gradients can occur. For example, consider a network with 10 layers, each with weights initialized at 0.1. In the backward pass, the gradients associated with each layer would be multiplied by 0.1, leading to very small gradients towards the input layer. Similarly, if we initialize the weights with large values, such as 10, the gradients associated with each layer would be multiplied by 10, leading to very large gradients towards the input layer, and thus potentially leading to exploding gradients. 
4. This can be difficult to determine without looking at the gradient tensor(s). However, if the model consistently has extremely small gradients, then vanishing gradients may be the issue, while if the model consistently has extremely large gradients or divergence, then exploding gradients may be the issue.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.



  Answer:

The derivative of the final loss with respect to each tensor is:
$$
\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_2}=\frac{1}{N} (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\left( \varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)\right)^\top + \lambda \mat{W}_2
$$

$$
\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_1}=\frac{1}{N} \mat{W}_2^\top (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\odot \varphi'(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1) \left(\vec{x}^{(i)}\right)^\top +\lambda \mat{W}_1
$$

$$
\frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_2}=\frac{1}{N}\sum_{i=1}^{N} (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})
$$

$$
\frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_1}= \frac{1}{N}\sum_{i=1}^{N}\mat{W}_2^\top(\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\odot\varphi'(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)
$$

$$
\frac{\partial L_{\mathcal{S}}}{\partial \mat{x}}=\frac{1}{N}\mat{W}_2^\top(\varphi(\mat{W}_1\vec{x}^{(i)}+\vec{b}_1) - y^{(i)})\odot\varphi'(\mat{W}_1\vec{x}^{(i)}+\vec{b}_1)\mat{W}_1



2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.





Answer: 


1. This formula can be used to compute gradients of neural network parameters numerically by evaluating the above quotient with a small perturbation $\Delta\vec{x}$ around each parameter value. Specifically, the gradient of a scalar function $f$ with respect to a parameter $\theta$ can be approximated numerically using:
$$
\frac{df(\theta)}{d\theta} \approx \frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon},
$$
where $\epsilon$ is a small scalar representing the perturbation size. This approximation can be computed for each parameter of a neural network, to obtain its numerical gradient.

2. The main drawbacks of this approach compared to AD are:
- Computationally expensive: computing the numerical gradient requires multiple forward passes through the neural network, resulting in increased computational cost compared to AD, which only requires a single forward and backward pass.
- Approximation errors: numerical differentiation introduces approximation errors since it approximates derivatives using finite differences, therefore, the numerical gradients may be inaccurate or unstable for some parameter values, especially if the perturbation size is not chosen carefully. In contrast, backpropagation computes derivatives directly and can be more precise.
- AD can also handle arbitrary computation graphs, including those that involve complex operations not easily amenable to finite difference approximations.

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [None]:
import copy

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

def grad_foo(W, b):
    return torch.mean(X @ W + b)

eps = 1e-6
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)

for i in range(d):
    for j in range(d):
        # calculate gradient w.r.t. W
        W_plus = copy.deepcopy(W)
        W_plus[i, j] += eps
        loss_plus = grad_foo(W_plus, b)
        W_minus = copy.deepcopy(W)
        W_minus[i, j] -= eps
        loss_minus = grad_foo(W_minus, b)
        grad_W[i, j] = (loss_plus - loss_minus) / (2 * eps)
        
    # calculate gradient w.r.t. b
    b_plus = copy.deepcopy(b)
    b_plus[i] += eps
    loss_plus = grad_foo(W, b_plus)
    b_minus = copy.deepcopy(b)
    b_minus[i] -= eps
    loss_minus = grad_foo(W, b_minus)
    grad_b[i] = (loss_plus - loss_minus) / (2 * eps)

loss.backward()
autograd_W = W.grad
autograd_b = b.grad


assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.9567, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  2. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. How you would implement `nn.Embedding` yourself using only torch tensors. 

Answer :
  1. The output of `Y` is a tensor of shape `(batch_size, seq_length, embedding_size)`. It contains the embeddings of the input sequence `X`, where each word in the sequence is represented by a vector of length `embedding_size`. The output shape is determined by the size of the input tensor `X` and the number of embedding dimensions specified when initializing the `nn.Embedding` layer. In this case, `embedding_size` is set to 128, which means that each word will be represented by a vector of length 128.

2. To implement `nn.Embedding` using only torch tensors, we can create a new tensor `embedding_weight` with shape `(num_embeddings, embedding_dim)` to serve as the trainable parameters for the embedding layer. We can then index into this tensor to obtain the embeddings for a given input sequence. The implementation of `nn.Embedding` can be written as follows:

```python
import torch

class Embedding(torch.nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super().__init__()

        # Initialize the embedding weights
        self.embedding_weight = torch.nn.Parameter(torch.randn(num_embeddings, embedding_dim))

    def forward(self, input_tensor):
        # Index into the embedding weight tensor to get the embeddings
        embeddings = self.embedding_weight[input_tensor]

        return embeddings
```

In this implementation, `num_embeddings` is the size of the vocabulary, and `embedding_dim` is the number of embedding dimensions. The `forward` method takes as input an integer tensor `input_tensor` of shape `(batch_size, seq_length)` containing sequences of word indices, and returns a float tensor of shape `(batch_size, seq_length, embedding_dim)` containing the corresponding word embeddings.

In [11]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.


  Answer:
1. True. Truncated Backpropagation Through Time (TBPTT) is a modification of the backpropagation algorithm used for training recurrent neural networks (RNNs) on sequence data. It works by breaking the sequence into smaller sub-sequences and performing backpropagation on each sub-sequence separately. The gradients computed during each sub-sequence are then accumulated and used to update the model parameters. 

2. False. Implementing TBPTT involves not only limiting the length of the sequence provided to the model but also storing the hidden states at each time step for computing the gradients during backpropagation. During training, the forward pass is performed by running the model on a sequence of length S, and during the backward pass, the gradients are computed for each time-step in the sequence, and then truncated to be used only over the past S time-steps. The hidden states that were stored during the forward pass are then used to compute the gradients for the truncated backpropagation.

3. True. TBPTT allows the model to learn relations between inputs that are at most S timesteps apart. During training, the gradients only flow backward S timesteps, which means that the model can only learn dependencies between timesteps that are within this range. This is a limitation of TBPTT but can be addressed by choosing an appropriate sequence length that balances the ability of the model to capture long-term dependencies, and the computational cost of training the model.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
2. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
3. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?

Answer:

1. In machine translation, the addition of attention mechanism between the encoder and decoder provides the decoder with a mechanism to selectively focus on specific parts of the encoded source sequence, conditioned on the target sequence being generated. This is done by computing attention scores between each target hidden state and each of the source hidden states, producing a weighted sum (or attention weighted average) of the source hidden states based on these scores to generate an alignment context vector, which in turn is used to adjust the decoder hidden state. This allows the model to pay more attention to relevant parts of the source sequence at different points in the decoding process, which can significantly improve the quality of the generated translations, particularly for long sequences. The hidden states with attention are different from the model without attention in that they explicitly incorporate the context of the source sequence, which allows the model to generate target sequence tokens based on a weighted combination of the input sequence at each step instead of a summary from the entire input sequence encoded in the final hidden state.

2. If we change the queries, keys, and values to self-attention in the decoder, it would enable the decoder to recur over its own previously generated sequence while computing the context vector. This means, at each decoding step, the decoder can focus on different parts of the sequence generated so far to produce the next sequence element. This would allow the decoder to learn more abstract and generalized representations of the source sequence during decoding, and capture finer-grained dependencies in the target sequence with the input source sequence. However, it could also create a risk of attending to inconsistent parts of the decoder output, leading to instability in the decoding process. Therefore, to mitigate this, the self-attention mechanism is typically employed in addition to the source-derived attention mechanism in most sequence-to-sequence models, such as the Transformer architecture. Finally, by using self-attention, we could expect hidden states that are more naturally adaptive to the sequence data given that the weight applied to each hidden state is computed based on its relationship to other hidden states in the temporal context.

3.  If we decide to use self-attention with the keys, queries, and values equal to the encoder's hidden states, it would provide a mechanism for the decoder to focus on different parts of the input sequence depending on the target sequence being generated. In this case, the decoder would attend to different parts of the encoded input sequence based on how they relate to each other, as opposed to using deterministic feature-wise projections. This would allow the decoder to learn more generalized patterns and structure from the input sequence as opposed to relying on a summary of the entire sequence.

Furthermore, using self-attention would provide more contextual information about the input sequence during encoding and decoding, allowing the model to capture longer distance dependencies in the input sequence as the model recursively attends to itself through the self-attention mechanism. This would allow for a better capture of the relationships between different elements in the input sequence, potentially leading to better translation quality.

However, using self-attention can sometimes lead to a reduction in model interpretability when compared to attention mechanisms that use pre-specified queries for decoding. This is because the self-attention mechanism generates queries based on the previously generated outputs instead of explicitly specified tokens, making it harder to interpret the internal workings of the model. Another disadvantage is that self-attention can be more computationally intensive than other attention mechanisms, such as content-based attention.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

2. Images reconstructed by the model during training ($x\to z \to x'$)?
3. Images generated by the model ($z \to x'$)?


Answer


1. If the KL-divergence term is not included in the loss function during training of a VAE, it would prevent the model from effectively learning a useful latent representation of the input data.

2. Without the KL-divergence term in the loss function during training phase, the VAE model would essentially become a standard autoencoder. It would learn to map the input data to the latent space without any probabilistic interpretation, a deterministic mapping ($x \to Z$). While it may be able to reconstruct the input during training, this latent space may not have a high level of variance when samples are drawn from it, and there might not be any significant benefit of using the VAE architecture over a conventional autoencoder.

3. Similarly, if the KL-divergence term is not included during generation ($z \to x'$), it would result in the model being only able to generate deterministic outputs that are very similar to the inputs present in the training set. The model would be unable to generate diverse samples from the latent distribution, making it less useful for generative applications like sample generation or data augmentation. The presence of the KL-divergence term in the loss function controls the regularization of the probabilistic latent distribution and makes it possible to sample from the distribution for novel output image generation. If the KL-divergence term is not present, generation would be limited to deterministic outputs generated from single points in the latent space rather than a probabilistic output from a distribution of possible values.





2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.


Answer: 
1. False. While it is true that in principle, the latent space distribution is a normal distribution, it is not necessarily $\mathcal{N}(\vec{0},\vec{I})$. In fact, during training, the VAE model tries to learn a better distribution that can represent the training data more effectively. Therefore, the latent-space distribution after training will depend on the specific data distribution of the input dataset.

2. False. The reconstruction obtained from multiple encodings of the same image may not be the same due to the stochasticity of the decoder's output. During training, the VAE is designed such that the noise is introduced into the latent space so that it can generate a diversity of outputs with high fidelity. Therefore, the output may vary between different runs of the decoder, even for the same encoded input.

3. True. The real VAE loss term involves computing the Kullback-Leibler (KL) divergence, which is intractable to compute directly. Therefore, during optimization, we resort to minimizing an upper bound on the KL divergence called the variational lower bound. The hope is that this bound is tight, meaning that it is close to the real loss value. By minimizing this bound, we can effectively train the VAE model.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  2. It's crucial to backpropagate into the generator when training the discriminator.
  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.




  Answer:

  1. True. In GANs, the generator's goal is to generate images that are similar to the real images, and the discriminator's goal is to differentiate between real images and generated images. So, a low generator loss implies that the generator is generating realistic images, and a high discriminator loss indicates that it's challenging for the discriminator to differentiate between real and generated images, which is desirable for the generator.

2. False. When training the discriminator, we don't need to backpropagate through it to update the generator. The generator is updated only based on how well the discriminator was fooled during the forward pass, which can be obtained without any backpropagation.

3. True. In GANs, the generator learns to generate images from a noise vector sampled from a prior distribution. Typically, this prior is chosen as $\mathcal{N}(\vec{0},\vec{I})$.

4. True. Initializing the discriminator weights to arbitrary values can cause the discriminator to output random scores at the start of training, which can cause the generator to update its weights to generate low-quality images. Pre-training the discriminator for a few epochs helps it to settle into a reasonable solution, which can stabilize the training process and lead to better generator performance.

5. False. If the discriminator reaches a stable state with 50% accuracy, it implies that it's no longer able to distinguish between real and generated images, and the generator has reached its performance limit. In such a scenario, further training of the generator won't improve the image quality, and it would be better to stop the training process.