# The Deep Learning Book (Simplified)
## Part II - Modern Practical Deep Networks
*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)
where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be the most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on 
notation.*


## Chapter 11: Practical Methodology

We are excited to say that this is going to be the last chapter that we cover before entering the Deep Learning Research section of the book which is, for the most part, unfamiliar terrains for us. A lot of what we'd been talking about till now covered the theoretical aspects of Deep Learning. However, there's a large gap between theory and what works in practice. This chapter is specifically dedicated to practitioners and people who are looking to apply Deep Learning for building cool applications and solving real-world problems. 

The various choices that one might need to make include which type of data to gather, where would they find that data, should they gather more data, change model complexities, change (add/remove) regularization, improve optimization, debug the software implementation, etc. The recommended practical design process is as follows:

- Decide on a a single number metric to evaluate your model. This represents the final goal and you need to set a specific target that you want to achieve. Coming from Andrew Ng's Machine Learning Yearning and also from personal experience, most teams forget to decide upon this only to realize the mistake very late in the process that setting this up would have gave them a clear guide on what they wanted to improve.

- Get an end-to-end pipeline working as soon as possible, including the evaluation of the required metrics. This will, more often than not, require that you use a very simple model that can accept the inputs correctly and produce the outputs in the right format that can be further used for training / evaluation / analysis. The major benefit here is that now you can solely focus on improving the model and on doing any specific change, you can instantly get the final results and check whether that change improved the model or not.

![pipeline](images/workflow_final.png)

- Instrument the system well to determine bottlenecks in performance which requires diagnosing which components perform worse than expected and understanding the reason behind poor performance - overfitting, underfitting, modelling, problems in data, software implementation errors, etc.

- Based on the diagnosis above, keep improving the algorithm iteratively either by adding more data, increasing the capacity of the model, tuning hyperparameters or improving the quality of data by better annotation, etc.

The chapter is organized as follows:

**1. Performance Metrics** <br>
**2. Default Baseline Models** <br>
**3. Determining Whether to Gather More Data** <br>
**4. Selecting Hyperparameters** <br>
**5. Debugging Strategies**

## 1. Performance Metrics

## 2. Default Baseline models

## 3. Determining Whether to Gather more data

A rookie mistake that a lot of people make is that they keep trying different algorithms to improve the performance of their models, whereas simply improving the data they have or gathering more data can be the best source of improvement. We touched upon the topic of how to decide when to get more data, but since data is the most integral part of getting an AI solution working, we'll explore this in a bit more detail now.
So, how do you decide when to get more data? Firstly, if the performance of your model on your training set is poor, it is not making full use of the information present in your data and in this case, you need to increase the complexity of your model by adding more layers or increasing the number of hidden units in each layer. Also, hyperparameter tuning is an important step to perform. You'd be surprised how large an effect choosing the right hyperparameters can have in getting your model working. For example, learning rate is THE [most important](https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2#7da2) hyperparameter that you need to tune. Setting the right value of the learning rate for your problem can save you loads of hours of wasted effort. However, if your model is reasonably complex and optimization carefully tuned but still the performance is not up to the desired level, the problem might be the quality of data instead, in which you have to go back to square one and start collecting cleaner data.
If the training error is low but the validation error is much higher, then you can safely assume that your best would be to say:

![data meme](images/data_meme.jpg)

The specific situation mentioned above, where training error is low but test error is high, is called overfitting and is one of the most commonly occurring problems in training deep models, in which case regularization might help. To reinforce the importance of data in the modern deep networks, for those who might not be aware, the reason that Deep Learning started gaining attention was the ImageNet competition where a deep learning model outperformed the previous best model by a significantly large margin in 2012. ImageNet consists of millions of annotated images and the creation of similar large labelled datasets is the reason that extremely complex problems like object detection have become solved problems today.
Finally, it's generally observed that adding a small fraction of the total number of examples won't have a noticeable effect on the performance. Thus, we need to monitor how much the performance of a model improves as the dataset size increases and it should be monitored at a logarithmic scale.

![train dev](images/train_dev.png)

As can been seen from the plot above, the training error will generally increase as you increase the dataset size. This is because the model will find it harder to fit to all the datapoints exactly now. Also, by increasing the dataset size, your validation (dev) error will decrease as the model would learn to be more generalized now.

## 4. Selecting Hyperparameters

## 5. Debugging Strategies

- *Visualize the model in action*: This is one of the best ways to verify if the training is going correctly and also, understanding which areas might need improvement. Once the training starts, visualize the output of your model after a few epochs. If you're working on a semantic segmentation problem, look at the segmentation output. If you're training a generative model of speech, listen to a few sample of speech that it produces. Also, it's common to have bugs in the evaluation metric as they might need corner-case handling which you might not have taken care of. Evaluation bugs are the hardest ones to catch and they fool you into believing that your model is performing/not performing well.

![model output](images/model_output.png)

- *Visualize the worst mistakes*: Going back to the semantic segmentation problem above, suppose we run the model on our test set. Based on the IoU scores, we can sort the samples to identify where our model performed the worst. Visualizing those examples where the model fails terribly, is a great way to identify errors in data processing/annotation. In the case where you infer that the problem had been with the annotation of data, the best way to improve performance would be to actually correct the annotations, even manually if required, as the payoff of having the correct data is very high.

![google mistake](images/google_mistake.jpg)

Google misclassified the photo of humans as that of gorillas. It came under some scrutiny for having this bias in its algorithms

- *Fit a tiny dataset*: Before starting to train on your entire training set, always fit your model to a small subset of the entire dataset. Even very simple models will overfit to a handful of examples. Taking the extreme case of a single example, it's very easy to correctly fit to it by setting the weights to zero and the biases appropriately. From my practical experience too, if you're making a modification or trying something different, first make sure that it can overfit on a small enough dataset. If it can't, then there's a high probability that there's been a software bug in setting up the training process.

- *Monitor histograms of activations and gradients*: It can be useful to monitor the pre-activation values of hidden units in case there is a problem in training. What to monitor depends on the type of activation function used. For example, in the case of ReLU (commonly used between layers), we can check how often is the unit off (which would happen if the pre-activation value is < 0). In the case of sigmoidal units, it can be useful to check how often does it stay  in the saturated regions, i.e. either too positive or too negative. Also, if the gradients grow or vanish too quickly, it can be a problem during training. It has advised in the book that the magnitude of the gradient should be approximately 1% of the magnitude of the parameter, neither too high (50%) nor too low (0.001%). Thus, comparing the two magnitudes can be a good approach for debugging too. 
Finally, it can be shown (covered in later chapters) that some optimization algorithms provide certain guarantees, like the objective function not increasing after each epoch, all the gradients being zero at convergence, etc. and we can ensure that these guarantees are met.