# Error Analysis

## Part 1. Carrying out Error Analysis

If perf is lower than human-level you can carry out error analysis to figure out the issues.


Example: Cat classifier fails on certain dog pictures.

Opp sizing: If you want to add a lot of dog picture labeling, do the following:

- Get ~100 mislabeled dev set examples.
- Count up how many are dogs.

So if only 5% were dogs than the opp size of the dog annotation will only bring 10% to 9.5% which might not be very impactful. 

Such approach gives you "ceiling" about the opp size.


If you see 50 of them were dogs, than your error might go all the way down to 5%!! 


If you are building applied systems, it is very valuable to do such manual work and have some manual improvements, it is totally fine to do it!!


**Evaluate multiple ideas in parallel**

Have a spreadsheet where each row is task sample and each column is one idea:

- Fix pictures of dogs
- Fix great cats (lions,panthers) being misrecognized
- Blurry images 
- ...


So while going over each sample simply annotate which categories this error belong to (sometimes multiple categories: blurry dog image, rainy dog image for example).


Then based on this analysis, you have better idea on what to improve in the next iteration. If you have multiple big opportunities, you can even have multiple teams in parallel: one fix dog issue, the other fix blurry image issue (this reminded me of how Siri team worked on many initiatives before the launch try to bring the error rate down).

## Part 2. Cleaning up Incorrectly labeled data

Is it worthwhile to take a lot of effort to fix incorrectly labeled data?


Random errors: Annotators made some random mistakes.


Systematic errors: Annotators always labeled "white dogs" as cats. 

Training set: DL algorithms are quite robust to _random_ errors in the training set. It is very common that some famous models are trained on datasets that had some incorrectly labeled BUT systematic errors would cause issues!

**How about dev/test?** 

- Add an "incorrectly labeled?" column in your error analysis to estimate the % of occurrence for these. 
    
- If let's say only 6% of 10% error (0.6%) is coming from incorret labels it might not worth the effort. Focus on the remaining 9.4% 

- Suppose your overall error went down to 2%. Now your incorrect labels account for 30% of your overall error. Now might be better time to fix the labeling issues.

Some guidelines around this:

- Apply same process to your dev and test sets to ensure they still have the same distribution.
- Consider examining examples your algorithm got RIGHT as well. If you only fix the samples it got wrong, it gives your algorithm a bit of unfair advantage. This is not super easy because if the model has 98% accuracy, it would take a looot to validate the correctly predicted samples.
- Remember that train and dev/test data may now come from slightly different distributions (welcome to data mismatch issues). Your training is kinda robust so it could be fine.


**Some other wisdom**:

- In DL era, there is this tendency to simply train the model and get the aggregated results and assume you are done if performance is satisfactory.
- However, in applied deep learning there is always hand engineering and manual error analysis (then DL practitioners would like to acknowledge).
- Some people have hesitancy to look into manual error analysis, read data (which is less exciting than coding etc.). But this is actually super important: Put some hours and look into error analysis manually to understand the next steps. This is critical and Andrew still does it in his projects.

## Part 3. Build your first system quickly, then iterate

Speech Recognition example. There are maaany things to do to reduce the noise and improve accuracy:

- Stuttering
- Noisy background
- Accented speech
- Far from microphone


For any ML application, there could be 50 different directions, which one to prioritize for your specific use case?

Solution: Set up dev/test set and metric, build initial system quickly, use bias/variance analysis + error analysis to prioritize next steps.

So important guideline: Build your first system quickly, then iterate!!! Do not overcomplicate your initial solution. Build something quick and iterate

# Mismatched Training and Dev/Test Set

# Part 1. Training and Testing on Different Distributions

DL models have a lot of hunger for MORE data. Some teams might put some data that is not from same distribution with dev/test to increase their training set size.

Examples

- Data from webpages: 200k
- Data from mobile app: 10k (dev/test set)

On the internet you might have huge amount of data but you might have little data from mobile app (which you ultimately care about).

So what can you do?


(bad option) Option 1: Random shuffling

- Combine both dataset, randomly shuffle them and split to train/dev/test. So your training will be mostly internet but also have few mobile app data.
- Main disadvantage is there will be minimal mobile app samples on dev and test.
- Like only 238 example on dev test will be actually coming from the mobile app pictures....


(good option) Option 2:
- Training set have all web (200k) + 5k from mobile app
- Test: 2.5k mobile app
- Dev: 2.5k mobile app
- This way we only evaluate our model only on relevant data!!
- Main disadvantage is that your training is coming from different distribution than your dev/test


## Part 2. Bias and Variance with Mismatched Data Distributions 

The way you analyze bias and variance changes when your training and dev/test is coming from DIFFERENT distribution.

Assume humans get 0% error and train and dev are from different distribution. 

- Training error 1%.
- Dev error 10%.
- In the above case we are not sure if this is a variance issue because it might be that the training data was easier (contained more easy examples). We changed two things at once:
    - The algorithm saw the training but not dev.
    - The distribution of training and dev are different.
    
We don't know which of the above two things caused the issue.


**Solution:**

Create a new set: Training-dev set which has same distribution as training set but NOT used for training. Critical thing here is that it is coming from the same distribution AND that it was not seen during training.

Now if we see train error 1%, train-dev error is 9% and dev error is 10%, this definitely shows that this is a variance issue. Because train-dev is coming from the same distribution.


If we had train error 1%, train-dev 1.5% and dev 10%, this is a data mismatch problem and requires different solution. 



**General principles** 

Main things to look at:

- Human level (bayes) error
- Training set error
- Train-dev error
- Dev error
- Test error

These are enough to determine all below:
- Avoidable bias
- Variance
- Data Mismatch
- Degree of overfitting to the dev set


Note: If you overfit to dev set, might be good idea to increase the dev set size.


Sometimes we would see that train error % is actually HIGHER than dev error. It could be the case that training data is actually harder to classify than your downstream task (which dev/test is coming from).


## Part 3. Addressing Data Mismatch

(Honestly there are no super systematic ways to deal with data mismatch problem)


- Carry out manual error analysis to understand difference between training and dev/test (don't do error analysis on test of course).

For example you can see that dev contains many examples with car noise compared to training

- See if you can make the training data more similar or collect more data similar to dev/test.

**Synthetic Data**

One way you can achieve this is to use synthetic data. Get normal sentence and car noise simply add them together to get synthesized in-car audio... 


Some nuance about synthetic data generation:

- Imagine you have 10k hours of human text audio (someone reading anormal sentence).
- 1 hour of car noise 
- If you simply multiply the car noise 10k times for all human text, your model might actually overfit to the 1 hour car noise.
- And the 1 hour car noise might only be a small portion of all possible in-car noise.


Challenges:

- It would be difficult to detect differences for the different in-car noises for the human ear...
- Your synthetic data might only represent a small set of all possible cases.


**Example: Car detection**


If we use synthetically generated car images, it might be the case that these artificial images might represent only a tiny portion of all car representations. For humans, we think they are all different cars but from model perspective they might not be representative of all cars.



# Learn from Multiple Tasks

## Part 1. Transfer Learning

Use cat classifier model to radiology diagnosis.

- Remove the last layer of your classifier.
- Add new last layer. Train the last layer of your model on the radiology diagnosis dataset.

Sometimes, you might want to freeze all hidden layers (for example if your new data is limited in size). If you have a looot of data on the new data you might train the whole model.

It is useful because low level features are transferrable across tasks.

Sometimes you might also want to add multiple layers for the new task...



When does transfer learning make sense?

- When the source task has a lot of data.
- Target task data is limited.
- Input is same (both audio, image etc.).
- Low level features from A could be helpful for learning B (this is kind of human assumption).

If you had the opposite case (radiology 100k data and cat classifier has 100 samples) it doesnt make sense to transfer from the cat classifier.


## Part 2. Multi-task Learning

Learn from multiple tasks at the same time!! Transfer learning was sequential, this is simultaneous.


Typical example is autonomous driving. You have so many tasks:

- Pedestrian detection
- Car detection
- Stop signs
- Traffic lights
- ...

Some image will have multiple of these so the label (y) might look like this: $[0,1,1,0]$ this is multi-hot encoding.

Model for predicting all at once:

$$
Cost = \dfrac{1}{m} \sum_i^{m} \sum_j^{4} Loss(y_{predj}, y_j)
$$

- Sum the loss over 4 components.

- Main difference from the softmax loss is that we dont have single label for each sample. So the probabilities of this (4,1) vector shouldn't sum up to 1!!

- Such model is a multi-task learning model because these could be 4 separate models. 


**Interesting Note.**

- Multi-task learning would stil work even if you have some missing labels. 
- Assume you have some annotation for car and pedestrian only (some annotators did not have the other two classes.) and others had first two missing.
    - Example annotation : $[1,0,?,?]$ (pedestrian, no car and missing annotation for last two) or $[?,?,1,1]$.
- You can use this data for multi-task learning even though all samples have missing annotation.
- Only thing you change is you remove calculating loss over $j$ if $y_j$ (annotation for that class) is missing



#### When multi-task learning makes sense?

- Low level features are shared across the tasks.
- Usually amount of data you have for each task is quite similar.
- If you have 100 tasks with 1k samples each, by training on all tasks you get the 1k samples from the other 99 tasks which gives you 100x increase!!
- Can train a big enough neural network that can do well on all the tasks. Probably it will take longer and your model needs to be larger.

It is used muuuch less often than sequential transfer learning. Computer vision is one of main application (object detection of all object types).



# End-to-end Deep Learning

## Part 1. What is e2e deep learning?

Multiple stages of machine learning -> Only one DNN to handle everything



Traditional Speech Recognition

- Convert audio to some low-level features (using MFCC)
- ML model to extract phonemes
- Another tool to get words
- Finally get the transcript as the prediction (Y)


E2E approach

- Raw audio to transcript (Y)

Traditional approach makes sense when we lack data. With a lot of raw data, E2E approach might be more suitable.



**Example 2.** Face Recognition

Best approach is actually a multi-step approach.

- Raw image -> Face detection model extract the face only
- Face -> Identity model (only gets input the camera image)

Makes sense because one-step approach would be very complicated and sparse data for all camera angles etc.


- Each task has a lot of data available
- Each task is relatively easy


In contrast, 

- Very limited that combines the both (combination)
- Much harder to get identity from far away camera shot.



**Example 3.** Machine Translation

- We have a lot of pairs of texts (English-> French)
- For this task, E2E makes sense and performs well!!



## Part 2. Whether to use E2E Learning

Pros: 

- Let the data speak. Pure ML approach might be able to capture stuff that outperform the human defined concepts. E.g., forcing models to learn phonemes is restrictive
- Less hand-designing of components needed 


Cons: 

- May need large data.. 
- Potentially useful hand-designed insights are excluded..


When you have ton of data, hand-designed knowledge is less critical. For some tasks, such hand-designed features might be actually super helpful.


So key question: do you have sufficient data to learn a function of the complexity needed to map X->Y? 

Here sufficient and complexity are hard to define in a mathematical way of course.


Andrew finished with illustrating that autonomous driving consists of many distinct tasks:

- Object detection
- Route planning 
- Steering


Each step is a different system, e.g., route planning is a software it is not a machine learning model. Likewise steering is also simply a software.


Sometimes E2E is exciting but doesn't really make sense in some practical applications.

Given input image -> DNN outputs the steering function hehe