# Structuring Machine Learning Projects

## Introduction to ML Strategy

### Orthogonalization

**Orthogonalization process**: understanding what to tune to achieve which effect.

In orthogonalization, you have some controls, but each control does a specific task and does not affect other controls.

* Fit training set well on cost function: bigger networ, Adam optimizer, ...
* Fit dev set well on cost function: regularization, bigger train set, ...
* Fit test set well on cost function: bigger dev set, ...
* Perform well in real world: change dev set, change cost function, ....

There are thing that cannot be orthogonalize such as early stopping, as it affects both train and test performance

## Setting up your goals

### Single number evaluation metric

It is better and faster to set a single number evaluation metric for your project before you start it.

For example:

|                | Predicted cat | Predicted non-cat |
|----------------|---------------|-------------------|
| Actual cat     | 3 (True Positive)         | 2 (False Negative)       |
| Actual non-cat | 1  (False Positive)           | 4 (True Negative)    |

**precision**: percentage of true cats in the recognized result: P = 3/(3 + 1)
        
**recall**: percentage of true recognition cat of the all cat predictions: R = 3/(3 + 2)
        
**accuracy**: (3+4)/10

Using a precision/recall for evaluation is good in a lot of cases, but separately they don't tell you which algothims is better since generally there is a trade-off between them: one model may have better precision but worse recall.

A better thing is to combine precision and recall in one single (real) number evaluation metric. There a metric called **$F_1$ score**, which combines them as a harmonic mean:

$$F_1 = 2 * \frac{precision * recall}{precision + recall} = \frac{2}{\frac{1}{precision}+\frac{1}{recall}}$$

The highest possible value of $F_1$ is $1$, indicating perfect precision and recall, and the lowest possible value is $0$, if either the precision or the recall is zero.

### Satisfying and Optimizing metric

Its hard sometimes to get a single number evaluation metric. One can choose a single optimizing metric and decide that other metrics are satisfying. For example, once the running time is lower than $T$, maximize the $F_1$ score.

### Train/dev/test distributions

* Dev and test sets have to come from the same distribution.

* Choose dev set and test set to reflect data you expect to get in the future and consider important to do well on.

* Setting up the dev set, as well as the validation metric is really defining what target you want to aim at.

### Size of the dev and test sets

An old way of splitting the data was 70% training, 30% test or 60% training, 20% dev, 20% test. The old way was valid for a number of examples <100.000.

In the modern deep learning if you have a million or more examples a reasonable split would be 98% training, 1% dev, 1% test.

### When to change dev/test sets and metrics

Suppose your evaluation metric is the simple error:

$$\text{Error} = \frac{1}{m_{dev}} \sum_{i=1}^{m_{dev}} \mathcal{I}\{y_{\text{pred}}^{(i)} \neq y^{(i)}\}$$

If you don't want to misclassify a particular group of input (e.g. porn images) you can increase the weight of the error for that group of observations:

$$\text{Error} = \frac{1}{\sum_i w^{(i)}} \sum_{i=1}^{m_{dev}} w^{(i)} \mathcal{I}\{y_{\text{pred}}^{(i)} \neq y^{(i)}\}$$

where $w^{(i)} = 1$ is $x^{(i)}$ is non-porn and $10$ if is porn.

This is actually an example of an orthogonalization where you should take a machine learning problem and break it into distinct steps:

* Figure out how to define a metric that captures what you want to do - place the target.
* Worry about how to actually do well on this metric - how to aim/shoot accurately at the target.

Conclusion: if doing well on your metric and on the dev/test set doesn't correspond to doing well in your application, change your metric and/or dev/test set.

## Comparing to human-level performance

We compare to human-level performance because of two main reasons:
* Because of advances in deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level performance.
* It turns out that the workflow of designing and building a machine learning system is much more efficient when you're trying to do something that humans can also do.

After an algorithm reaches the human level performance the progress and accuracy slow down.

It is not possible to surpass what is called *Bayes optimal error*.


There isn't much error range between human-level error and Bayes optimal error.

Humans are quite good at a lot of tasks. So as long as Machine learning is worse than humans, you can:
* Get labeled data from humans.
* Gain insight from manual error analysis: why did a person get it right?
* Better analysis of bias/variance: for some problems lik computer vision it is enough to get close to human performance on train error without the need of doing better, while one can focus on reducing the difference between train and dev error.

Sometimes the human-level error can be considered a proxy for the Bayes optimal error, and the difference between the human error and the training error can be considered **avoidable bias**.

### Improving your model performance

The two fundamental asssumptions of supervised learning:
* You can fit the training set pretty well. This is roughly saying that you can achieve low avoidable bias.
* The training set performance generalizes pretty well to the dev/test set. This is roughly saying that variance is not too bad.

To improve your deep learning supervised system follow these guidelines:
* Look at the difference between human level error and the training error - avoidable bias.
* Look at the difference between the dev/test set and training set error - Variance.

If avoidable bias is large you have these options:
* Train bigger model.
* Train longer/better optimization algorithm (like Momentum, RMSprop, Adam).
* Find better NN architecture/hyperparameters search.

If variance is large you have these options:
* Get more training data.
* Regularization (L2, Dropout, data augmentation).
* Find better NN architecture/hyperparameters search.

## Error Analysis

**Error analysis**: process of manually examining mistakes that your algorithm is making. It can give you insights into what to do next. E.g.:

In the cat classification example, if you have 10% error on your dev set and you want to decrease the error. You discovered that some of the mislabeled data are dog pictures that look like cats. Should you try to make your cat classifier do better on dogs (this could take some weeks)?

Error analysis approach:
* Get 100 mislabeled dev set examples at random.
* Count up how many are dogs.
    * if 5 of 100 are dogs then training your classifier to do better on dogs will decrease your error up to 9.5% (called ceiling), which can be too little.
    * if 50 of 100 are dogs then you could decrease your error up to 5%, which is reasonable and you should work on that.

Based on the last example, error analysis helps you to analyze the error before taking an action that could take lot of time with no need.

Sometimes, you can evaluate multiple error analysis ideas in parallel and choose the best idea: get 100 mislabeled dev set examples, group the different mistakes and count which groups give you the higher rate of mistake.

### Cleaning up incorrectly labeled data

DL algorithms are quite robust to random errors in the training set but less robust to systematic errors. But it's OK to go and fix these labels if you can.

## Mismatched training and dev/test set

### Training and testing on different distributions

There are some strategies to follow up when training set distribution differs from dev/test sets distribution. For example you have relatively limited amount of data you really care about and you use huge amount of data similar but not equal to have more observations.

* Option one (not recommended): shuffle all the data together and extract randomly training and dev/test sets.
  * Advantages: all the sets now come from the same distribution.
  * Disadvantages: the other (real world) distribution that was in the dev/test sets will occur less in the new dev/test sets and that might be not what you want to achieve
      * $\rightarrow$ the dev set should be made of what you aim at!

* Option two: take some of the dev/test set examples and add them to the training set **but keep the dev/test mostly made by the data you reallt care about**.
  * Advantages: the distribution you care about is your target now.
  * Disadvantage: the distributions in training and dev/test sets are now different. But you will get a better performance over a long time.

### Bias and Variance with mismatched data distributions

Suppose you have 1% error on train and 10% error on dev set, this is not necessarely a variance problem because maybe the images in the train set were easier to classify while the images on the dev set were more difficult, i.e. they come from different distributions

One check is to take a small sample of the train set and keep it aside, call it train-dev set. Compute the error on this subsample:
* If the error in the train-dev set is 9% (similar to the dev/test set) then is a variance problem.
* If the error in the train-dev set is 1.5% (similar to the train set) then is a data mismatch problem between train and dev/test set.

$$\text{human-level} \underbrace{>}_{\text{avoidable bias}} \text{training set error}                                        \underbrace{>}_{\text{variance}} \text{train/dev set error}                                            \underbrace{>}_{\text{data mismatch}} \text{dev error}
$$

To which one can add

$$\text{dev error}  \underbrace{>}_{\text{degree of overfit to dev set}} \text{test error}$$

Is the difference is big (positive) then maybe you need to find a bigger dev set (dev set and test set come from the same distribution, so the only way to do much better on the dev set than the test set, is if you somehow managed to overfit the dev set).

### Addressing data mismatch

There aren't completely systematic solutions to this, but there some things you could try:
* Carry out manual error analysis to try to understand the difference between training and dev/test sets.
* Make training data more similar, or collect more data similar to dev/test sets.

If your goal is to make the training data more similar to your dev set one of the techniques you can use **Artificial data synthesis** that can help you make more training data $\rightarrow$ very useful in speech recognition systems.

Combine some of your training data with something that can convert it to the dev/test set distribution.

Examples:
* Combine normal audio with car noise to get audio with car noise example.
* Generate cars using 3D graphics in a car classification example.

Be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples because your NN might overfit these generated data (like particular car noise or a particular design of 3D graphics cars).

## Learning from multiple tasks


### Transfer learning

Transfer learning consists in using the knowledge you took in a task A and apply it in another task B.

For example, you have trained a cat classifier with a lot of data, you can use the part of the trained NN to solve x-ray classification problem.

To do transfer learning, delete the last layer of NN and it's weights and:
* Option 1: if you have a small data set - keep all the other weights as a fixed weights. Add a new last layer(-s) and initialize the new layer weights and feed the new data $(X,y)$ to the NN and learn the new weights $\rightarrow$ you only optimize the last weights.
* Option 2: if you have enough data you can retrain all the weights.

Option 1 and 2 are called fine-tuning and training on task A called pretraining.

Transfer learning makes sense if task A and B have the same input $X$ (e.g. image, audio) and you have a lot of data for the task A you are transferring from and relatively less data for the task B your transferring to. Low level features from task A could be helpful for learning task B.

### Multi-task learning

It is a model in which the target variable can take various values at once, for example if the image contains a car, a trafic light, a pedestrian altogether.

Unlike *softmax* the target can have multiple labels. In multy-task learning each $y^{(i)}$ has dimension $J$, equal to the number of items that can be recognized in every image, for example $y^{(i)} = [0 1 0 0 1 1]$, and the loss function sums over all observations and all items in $y^{(i)}$:

$$\text{Cost} = \frac{1}{m} \sum_{i=1}^m \sum_{j=1}^J L(\hat{y}^{(i)}_j, y^{(i)}_j)$$



Instead of training $J$ neural networks separately if some of the earlier features in neural network can be shared between the different types of objects, then training one neural network to do $J$ things results in better performance than training $J$ completely separate neural networks to do the $J$ tasks separately.

It also works if $y$ isn't complete for some labels:

$$\begin{matrix}
Y = &[1 ? 1 ...]\\
    &[0 0 1 ...]\\
    &[? 1 ? ...]
\end{matrix}$$
    

Multi-task learning makes sense if
* Training on a set of tasks that could benefit from having shared lower-level features.
* Usually, amount of data you have for each task is quite similar.
* Can train a big enough network to do well on all the tasks.

If you can train a big enough NN, the performance of the multi-task learning compared to splitting the tasks is better.

However, today transfer learning is used more often than multi-task learning.

## End-to-end deep learning

Some systems have multiple stages to implement. An end-to-end deep learning system implements all these stages with a single NN.

Example 1:
* Speech recognition system:
    * Audio ---> Features --> Phonemes --> Words --> Transcript    # non-end-to-end system
    * Audio ------------------------------------------------------> Transcript    # end-to-end deep learning system

End-to-end deep learning gives data more freedom, it might not use phonemes when training!

To build the end-to-end deep learning system that works well, we need a big dataset (more data then in non end-to-end system). If we have a small dataset the ordinary implementation could work just fine.

Example 2:

* Face recognition system:
    * Image ---------------------> Face recognition    # end-to-end deep learning system
    * Image --> Face detection --> Face recognition    # deep learning system - best approach for now

In practice, the best approach is the second one for now.

The second implementation it's a two steps approach where both parts are implemented using deep learning. Its working well because it's harder to get a lot of pictures with people in front of the camera than getting faces of people and compare them.

To train the last step of the second implementation the NN takes two faces as an input and outputs if the two faces are the same person or not.

Example 3:

* Machine translation system:
    * English --> Text analysis --> ... --> French    # non-end-to-end system
    * English ---------------------------------> French    # end-to-end deep learning system - best approach

Here end-to-end deep leaning system works better because we have enough data to build it.

Example 4:

* Estimating child's age from the x-ray picture of a hand:
    * Image --> Bones --> Age    # non-end-to-end system - best approach for now
    * Image ----------------> Age    # end-to-end system

In this example non-end-to-end system works better because we don't have enough data to train end-to-end system.

### Wheter to use end-to-end deep learning

Pros of end-to-end deep learning:
* Let the data speak. By having a pure machine learning approach, your NN learning input from X to Y may be more able to capture whatever statistics are in the data, rather than being forced to reflect human preconceptions For examples, in speech analytics the *phonemes* are just a human construction.
* Less hand-designing of components needed.

Cons of end-to-end deep learning:
* May need a large amount of data.
* Excludes potentially useful hand-design components (it helps more on the smaller dataset).

Applying end-to-end deep learning:

* Do you have sufficient data to learn a function of the complexity needed to map $X$ to $Y$?
* Use ML/DL to learn some individual components.
* When applying supervised learning you should carefully choose what types of $X$ to $Y$ mappings you want to learn depending on what task you can get data for $\rightarrow$ Sometimes it is better  to use DL only to solve part of a bigger problem.