

**1. After each stride-2 conv, why do we double the number of filters?**

After each stride-2 convolution, we double the number of filters because we want to learn more features. Stride-2 convolutions reduce the size of the output activation, so we need to increase the number of filters to compensate. This is because the number of features that can be learned is proportional to the size of the activation.

For example, if we have an input image with a size of 28x28 and we use a stride-2 convolution with a kernel size of 3x3, the output activation will have a size of 14x14. If we keep the number of filters the same, we will only be able to learn 14x14 = 196 features. However, if we double the number of filters, we will be able to learn 2*196 = 392 features.

**2. Why do we use a larger kernel with MNIST (with simple cnn) in the first conv?**

We use a larger kernel with MNIST (with simple CNN) in the first conv because we want to learn more general features. The first convolution layer is responsible for learning the most basic features in the image, such as edges and corners. By using a larger kernel, we can learn more general features that can be applied to a wider variety of images.

For example, if we use a kernel size of 3x3 in the first convolution layer, we can only learn features that are 3 pixels apart. However, if we use a kernel size of 5x5, we can learn features that are 5 pixels apart. This means that the first convolution layer will be able to learn more general features that can be applied to a wider variety of images.

**3. What data is saved by ActivationStats for each layer?**

ActivationStats saves the following data for each layer:

* The mean activation of the layer
* The standard deviation of the activation of the layer
* The minimum activation of the layer
* The maximum activation of the layer
* The number of activations that are above a certain threshold

This data can be used to track the progress of training and to identify potential problems with the model. For example, if the mean activation of a layer is very low, it may indicate that the layer is not learning anything. Similarly, if the standard deviation of a layer is very low, it may indicate that the layer is not learning any new information.

**4. How do we get a learner&#39;s callback after they&#39;ve completed training?**

To get a learner's callback after they have completed training, you can use the `learner.on_completed()` method. This method takes a callback function as an argument. The callback function will be called after the learner has completed training.

For example, the following code shows how to get a learner's callback after they have completed training:

```
import fastai

def my_callback(learner):
  print("Training completed!")

learner = fastai.Learner()
learner.on_completed(my_callback)
learner.fit()
```

**5. What are the drawbacks of activations above zero?**

Activations above zero can cause the model to become unstable. This is because the model may start to learn features that are not relevant to the task at hand. For example, if the model is trying to classify images of cats and dogs, it may start to learn features that are specific to one particular cat or dog. This can make it difficult for the model to generalize to new images of cats and dogs.

**6. Draw up the benefits and drawbacks of practicing in larger batches?**

Training in larger batches can have both benefits and drawbacks.

**Benefits:**

* Training in larger batches can be more efficient. This is because the model can process more data in each iteration.
* Training in larger batches can help to reduce overfitting. This is because the model is less likely to learn features that are specific to a particular batch of data.

**Drawbacks:**

* Training in larger batches can be more difficult to debug. This is because it can be more difficult to identify the source of a problem when the model is trained on a large amount of data.
* Training in larger batches can require more memory. This is because the model needs to store the entire batch of data in memory before it can start training.



**7. Why should we avoid starting training with a high learning rate?**

We should avoid starting training with a high learning rate because it can cause the model to diverge. This is because the model may make large changes to its weights in each iteration, which can lead to the model becoming unstable.

A high learning rate can also cause the model to overfit the training data. This is because the model may learn the noise in the training data, which can make it difficult for the model to generalize to new data.

**8. What are the pros of studying with a high rate of learning?**

The pros of studying with a high rate of learning include:

* The model can learn faster.
* The model can find the optimal solution more quickly.
* The model can be more robust to noise in the training data.

**9. Why do we want to end the training with a low learning rate?**

We want to end the training with a low learning rate because it helps the model to converge. This is because the model will make smaller changes to its weights in each iteration, which will help the model to find a stable solution.

A low learning rate can also help the model to generalize to new data. This is because the model will not be as sensitive to noise in the training data.

