# Introduction to ML Strategy
(module-level)

## Part 1. Why ML Strategy
(video-level)


Much more quickly and efficiently get your models working.


Motivating example: Imagine you want to increase 90% accuracy of your cat classifier.

You might have many ideas:
- more data
- more diverse set
- training algorithm
- ...


There are sooo many things to try out. Only to realize for example more data barely changes the results. We need quick and effective ways to identify what ideas worth pursuing given a problem.

This course teaches strategies and lessons learned for shipping large number of products: So stuff not really thought at school.

- DL strategies are different than traditional ML strategies and constantly evolving!!



## Part 2. Orthogonalization

One major challenge is sooo many things to try. 

Good ML engineers are very good at knowing what to change in order to get a desired change in the model performance.

TV example:
- Various knobs controlling various adjustments
- Each knob has a distinct feature adjustment: vertical adjustment, horizontal adjustment, width etc.
- Imagine that knobs change many things at once. 
- Orthogonalization: Each knob adjusts an distinct feature.


Car example:
- Steering 
- Accelerator
- Braking
Imagine these are controlled using same knob (e.g., 0.1 steering, 0.5 acceleration for a given button), it would be very hard to drive the car...

You don't want your controllers to affect multiple things at once. In the car example, you dont want your accelerator change the direction of your car and vice-versa.


Going back to ML: 

#### Chain of assumptions in ML 
In order of dependence you assume the following:

- Fit training set well on cost function
- (prior step assumes it helps) Fit dev set well on cost function
- (which helps) Fit test set well on cost function
- (which helps) Performs well in real world



For each step, we need orthogonal (separate) set of things to try to achieve what you want.


For example, if you are not achieving step 1 you try:

- Bigger network
- Adam optimization etc.

But for step 2 you would try:

- Regularization
- Bigger training set

For test set you would try:

-  Get larger dev set: You over-tuned on the dev set

For step 4 (doesn't perform well on real world):

- Change dev/test set distribution
- Change cost function: If your model performs well on test set it means something was wrong with the definition of your cost function..


Andrew personally doesn't like Early stopping because it affects the first two steps at the same time (step1 and step2)... But of course it is not too horrible to sometimes have knobs that affect two things.



# Setting up Your Goal

## Part 1. Single Number Evaluation Metric

You will find that if you have one single number it will be much much faster to iterate!!

Scenario one: 

    precision-recall. If one improves the other degrades. Which motivates us to use F-1 score instead so that we can iterate quicker.. Simple but always good to keep in mind. 


Scenario two: 

    Imagine you evaluate on four countries:

```
# error
model1 = us: 3%, china: 7%, india: 5%, other: 5% 
model2 = us: 5%, china: 6%, india: 3%, other: 10% 

```

It is better to look at average or weighted average so that you can effectively iterate over different models.

## Part 2. Satisficing and Optimizing Metric

It is not really meaningful to combine metrics always though (as was the case with accuracy or precision-recall).

Example 1: Accuracy and Running time: It doesn't make really sense that you combine these metrics to come up with a single evaluation metric.
    
    Solution: Satificing/Optimizing metrics: You choose accuracy as the accuracy metric subject to the running time < 100ms (satisficing metric)
    
    
Example 2: Wake word detection accuracy. Accuracy of Alexa detecting the word "Alexa"

    Solution: Optimizing metric is accuracy. However we want to make sure we minimize the false positives. So we can set a satisificing metric: <2 false positives every 24 hours of operation.
    
    


## Part 3. Train/Dev/Test Set Distributions

Example 1: Lets say you have dev/test set coming from various regions: 

    - US
    - Other Europe
    - Other Asia
    - UK
    - China
    - India
    - Australia
    
One way would be to set first 4 regions as dev and remaining three regions as test set. But this is actually a horrible idea because you will be iterating on a dev set that has different distribution on the data that you want to perform well on (that is what Test set stands for). 

    Solution: Randomly shuffle all data from all regions and split equally. This will ensure that your data for dev and test has same distribution.


Real Scenario 1: Optimizing on dev set on loan approvals for medium income zip codes. Then, test set is on low income zip codes.

    This is a typical problem, where the data you want to perform well on (test set) has different distribution than the data you iterated on... 
    
To avoid this, we have to make sure:

- Test should be representative of what you want to apply this model to and consider important to do well on.
- Then make sure your dev set is also coming from the same distribution!!
    
    
    

## Part 4. Size of the Dev and Test Sets

How large should they be? It is changing in the deep learning era (as was discussed before).

In early era it was just like 70-30% split or 60-20-20% split.


Now we work with super large datasets. Reasonable to have 98-1-1% split instead. 



**Size of your test set**: Big enough that you get high confidence in the overall performance of your system. Imagine if you set it to be 100, it might be misleading in both ways.


In some applications, you don't need train/dev/test split. You just need train/dev (this was also formerly discussed). This is unusual but sometimes it is fine.



## Part 5. When to Change Dev/Test Sets and Metrics?

Sometimes part way through the project we might need to move the target and update the splits and metrics, how to do that ?


When you think the rank order from your metric+dev: Prefer A whereas company user: Prefer B.


**Example: cat and non-violent example**

Cat classifier with lower error misclassifies a lot of violent images. This is a situation where the dev set and your evaluation metric (accuracy) is not aligned with user preference. User and company absolutely avoid violent image.

Solution:

    Instead of using averaged error on all samples, have a weighted average of your predictions:
    
$$
Error = \dfrac{1}{\sum_i w_i} \sum_i w^{(i)} L\{y_{pred},y\}\\
w^{(i)} = 1 \text{ if } x^{(i)} \text{ is non-violent else } 10
$$



This is actually an example of orthogonalization (separation of concerns):

1- Focus on how to define a metric to evaluate the classifiers (setting the target).

2- Then, worry separately about how to do well on this metric (training the model to hit the target).




# Comparing to Human-level Performance

## Part 1. Why human-level Performance?

Recently there is big trend of comparing ML models to humans.

Usually, performance is very high until reaching human-level. After surpass human level, the improvement usually slows down significantly and converges to an upper theoretical error which is called: Bayes optimal error. This is the theoretical upper bound, it is basically impossible to get anything above this performance.

Bayes optimal error doesn't have to be full score (e.g. 100% accuracy) all the time. If the images are very blurry we might have bayes optimal cat classifier of less than 100% accuracy.


Why progress slows down after human-level?

Two main reasons:
- Human level is usually close to bayes optimal.
- If below human-level, there are many tools we can use to improve the models performance.


Some example tools as long as human>ML model in performance:

- Get labeled data from humans.
- Manual error analysis: Gain insight: why a person get it right and model fail.
- Better analysis of bias/variance. As long as humans are performing better.


## Part 2. Avoidable Bias

We usually want to do "Well" on the training set. But having a human-level benchmark will prevent us from doing "too well" on the training set (aka overfitting) and also prevent us from trying to improve training perf. after certain point and focus on other aspects.

Let's think human error 1% on a task. if training error is 8% and 10% dev error. You would focus on bias.

    So if your model is far from human-level on training set it can be considered as "avoidable bias"!

Let's think human error 7.5% on a task. if training error is 8% and 10% on dev. Then you might not focus on bias and you would focus on reducing variance instead!!

Human-level error is a nice proxy for Bayes error for most tasks such as object detection!

    So if your model is close to human-level on training set, the remaining bias can be considered as "unavoidable bias"!

## Part 3. Understanding Human-level Performance

Is it "random human", "expert human", "team of experts"? 


Medical image classification:

- Typical human: 3%
- Typical doctor: 1%
- Experienced doctor: 0.7% 
- Team of experienced 0.5% error

Considering that human-level is used as proxy to "Bayes optimal" error, Andrew suggests using 0.5% as the "human-level" error because last option proves that such accuracy is possible.


For deploying a system or a publication, it depends on your context. If your surpassed typical doctor level, you might still want to deploy this model or publish your result.


As long as your avoidable bias (diff between human and model on training) is larger than your variance (model on training vs model on dev), you should focus on bias and vice-versa.


As models perform better and get close to human level, it becomes more critical that we set the correct "bayes" error to figure out which one (bias or variance) we should focus on next.


Remember that don't just set bias as diff from 0% error. Instead, use human-level as proxy to bayes error whenever possible..


## Part 4. Surpassing Human-level Performance

- Team of humans: 0.5%
- One human: 1%
- Training error: 0.6%
- Dev error : 0.8%

So your avoidable bias is 0.1% and 0.2% is your variance.


Example 2: 


- Team of humans: 0.5%
- One human: 1%
- Training error: 0.3%
- Dev error : 0.4%

Now it is more difficult whether we should focus on bias or variance because we are already better than team of humans. We are not sure whether we can further reduce the bias..


Examples ML surpass human-level:

- Online advertising
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals

All four examples have the following characteristics:

- learning from structured data. 
- Not natural perception problems such as vision (which humans are good at). 
- Also lot more data than a human can look at.


Besides there are some speech recognition, image recognition, narrow radiology for which ML can surpass.

## Part 5. Improving your Model Performance

Guidelines bringing everything together.

Two fundamental assumptions:

- You can fit the training set pretty well-> avoidable bias
- The training set performance generalizes well on dev/test -> variance


Reducing (avoidable) bias and variance:

Avoidable bias:

- Train bigger model
- Train longer/better optimization algorithms
- NN architecture/hyperparam search


Variance:

- More training data 
- Regularization: l2, dropout, data augmentation
- NN architecture/hyperparam search

