# ML Strategy

Based on Deeplearning.ai's Coursera course. 

# Week 1

## Orthogonalization

The **chain of assumptions** in ML is:

    1. fit training set well on cost function
    2. fit dev set well
    3. fit test set well
    4. perform well in real world

We need **different/orthogonal** tuning techniques at the different stages above.

Andrew Ng: typically **don't** using early stopping, because it affects both 1 and 2 above.

## Goal

Define a **single** number evaluation metric.

F1 score is the **harmonic mean** of precision and recal.

When you have N metrics, choose one as the **optimization objective**, others become **constraints**.

## Train/Dev/Test Distributions

They must be coming from the **same distribution**! Otherwise you'd be moving the target post, i.e. training to hit one, but test with another.

## When to change the metrics

When the metric is not ranking the algos in the desirable order, i.e. when some other factors are not considered.

## Human Level Performance

**Avoidable bias** - difference between human performance and model performance. When this is large, focus on reducing bias. 

Human level performance sometimes is hard to define. 1 person, professional, or a team of professtionals?

**Bayes optimal error rate** is the lowest theratically error rate possible. Human level performance can often be close to but not as good as the Bayes error rate. 

When **training error rate** is far from human level performance, but difference between the training and dev error rates are not as large, the focus should be to **reduce bias**, such as:

* using a larger model, 
* use a better optimization algorithm, or train longer
* use a different NN architecture/hyperparameter, etc. 

When the **difference between training and dev error rates** are large in the context of the difference between training and Bayes error rates, then focus should be to **reduce variance**, such as: 

* using more regularization, or 
* getting more training data, or
* use a different NN architecture/hyperparameter, etc.

# Week 2

## Error Analysis

Count mis-labelled dev set examples to get an idea of upper bound imporovement potential. 

Use a spreadsheet to score each potential issues/ideas.

### Incorrect Labels

DL algorithms are quite robust to **random errors** in the training set. They are not robust to **systematic errors**.

For **dev & test sets**, in error analysis, you can try to identify them, and work out the precentage impact. If significant, fix the data.

Breakdown the overall dev set error into:

* error due to incorrect labels, %
* errors due to other issues, %

When correcting labels:

* apply the same process for both dev and test sets to ensure they continume to come from the same distribution
* consider examples your algo got right as well as ones it got wrong
* training and dev/test data may now come from slightly different distribution (often this is ok)

## Mismatched Training and Dev/Test set

Problem: 2 sets of data coming from different distributions, e.g. high and low resolution images.

### Option 1

Combine data and shuffle. **Disadvantage**: dev and test set may only have a small portion of the data that you actually care about.

### Option 2

Split the data coming from the distribution you **want to target** into two parts, one part is mixed with the other data set to form the training set. The remaining half is **further split** into dev and test sets.

This method gives better performance because even though most of the training data is from a different distribution, the dev/test sets are from your target distribution.

###  Bias / Variance 

With option 2 above, bias/variance trade off assessment changes too. Dev set error would be higher than the training set error because they belong to different distributinos, we can **no longer** say we have a high variance problem.

**Solution**: shuffle the original training set, then split a small portion as the **training-dev** set, the rest remain as training set. Train the model with the smaller training set, and use training-dev set for error analysis. If training-dev set error is still high, then we can conclude that there is a variance problem. 

If training-dev set error is close to training error, but dev/test set error rates are high, then you have a **data mismatch** problem.


|       | Error Rate | Comment |
|:------|-----------:|:--------|
| Human Level | 4% | (proxy for Bayes Error) |
| Training set | 7% | Avoidable Bias = 7% - 4% = 3% |
| Training-dev set | 10% | Variance = 10% - 7% = 3% | 
| Dev set | 12% | Data mismatch error = 12% - 10% = 2% | 
| Test set | 12% | Degree of overfitting to dev set = 12% - 12% = 0% |

More general formulation

|                | General Data          | Specific Target Data | 
|:---------------|:---------------------|:--------------------|
| Human Level    | "Human level", 4%     | 6% (Ask human to label)  |
| Error on data trained on | "Training error", 7% | 12% (use target data in training)   |
| Error on data not trained on | "Training-dev error", 10% | "Dev/Test error", 12% | 

### Data Mismatch

* Carry out manual error analysis to understand the differences between training and dev/test sets
* Make training data more similar; or collect more data similar to dev/test sets

**Artifical Data Analysis** 

Example: Combine clean speech data + car noise = **synthesized** in-car audio

**Caution**: if car noise is too small here, e.g. 1hr vs 10k hours clean speech, it's possible that the network would **overfit** the noise/synthesized data. 

Synthesized data may only represent a small subset of the full data set/distribution.

## Transfer Learning

Example: Train a network with images for an image recognition task, then remove the weights for the last layer (typically FC), $L$, replace with another layer with randomly initialized weights, using radiology images to training the network again for a radiology diagnosis task. Weights for layers before the last layer are **frozen** in the new training.

Sometimes we pre-compute the input fed through the frozen layers, then save the output to disk for transfer learning training.

Depending on how much radiology image data you have, you can:

* small data: freeze weights for previous layers and only retrain the weights for the last layer
* large data: retrain all weights (in this case, the first training is known as **pre-training**, second training is known as **fine-tuning**.

When does it make sense for transfer learning from A to B?

* Task A and B have the same input x
* You have a lot more data for task A than B

## Multi-task Learning

Autonomous driving needs to identify a lot of different objects. Output label $\hat{y} \in \mathbb{R}^N$, then loss function is (summing over all dimensions of $\hat{y}$:

$$ \mathcal{L} = \frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{N} \mathcal{L}\big(\hat{y}_j^{(i)}, y^{(i)}\big) $$

In the lables $y$, there could be `nan` values because some objects may not be labelled. The summation above on $j$ should ignore these `nan` fields.

When does multi-task learning make sense?

* Train on a set of tasks that could benefit from having shared lower-level features
* Usually: amount of ata you have for each task is quite similar
* Can train a big enough network to do well on all the tasks

## End-to-End Learning

Traditional pipeline type of flow works well with small datasets, such as 3000hrs of data.

But end-to-end learning works better with 10k+ hours of data.

Sometimes you don't have enough data to solve an end-to-end problem, then you can break it down into steps, when for each step you have a lot more data.

End-to-End works quite well for machine translation. Not so well for estimating child's age, solution is to break it down, measure each bone's length. 

### When to use End-to-End?

Pros:
* Let the data speak
* Less hand-designing of components needed

Cons:
* May need large amout of data
* Excludes potentially useful hand-signed componenets

Key question: **Do you have sufficient data to learn a function of the complexity needed to map x to y?**