# Data Definition and Baseline
This week is all about working with different data types and ensuring label consistency for classification problems. This leads to establishing a performance baseline for your model and discussing strategies to improve it given your time and resources constraints.

### Why is data definition hard?
   - 1. Define data and establish baseline
   - 2. Label and organize data
   
# 1. Define Data and Establish Baseline

## Experiment Tracking
- Rather than worrying too much about exactly which experiment tracking framework to use, the number one thing to take away from this section is, do try to have some system, even if it's just a text file or just a spreadsheet for keeping track of your experiments, and include as much information as is convenient to inglude
- **What to track:**
    - Algorithm/code versioning
    - Dataset used
    - Hyperparameters
    - Results
- **Tracking tools:**
    - Text files (does not scale well)
    - Spreadsheets (scale much further, especially shared spreadsheets)
    - Experiment tracking system
- **Desirable features:**
    - Information needed to replicate results?
        - does your learning algorithm pull data off the internet? This can make experiments less reproducible
    - Experiment results, ideally with summary metrics/analysis
    - Perhaps also: Resource monitoring, visualization, model error analysis
- The space of experiment tracking systems is still evolving rapidly and so there's a growing set of tools out there. But some examples include:
    - W and B
    - ***Comet***
    - MLFlow
    - Sage Maker Studio
    - Landing.AI $\rightarrow$ focuses on computer vision and manufacturing applications

### From big data to good data
- Try to ensure consistently high-quality data in all phases of the ML project lifecycle.
- Good data:
    - Covers important cases (good coverage of inputs `x`)
    - Is defined consistently (definition of labels `y` is unambiguous)
    - Has timely feedback from production data (distribution covers data drift and concept drift)
    - Is sized appropriately

### More label ambiguity examples
- A common application in many large companies is user ID merge
    - **User ID merge:** When you have multiple data records that you think correspond to the same person and you want to merge these user records together.
    - One scenario where this commonly occurs is when one company purchase or merges with another company and a user has accounts with each (often with not-identical information in each)
    - One approach to the User ID merge problem is to take a supervised ML algorithm that takes as input two user data records and tries to output either one or zero based on whether it thinks these two are actually the sme user
- Other examples with ambiguous ground truths: 
    - Is predicting if an account is a spam/fake/bot account
    - Is an online purchase fraudulent?
    - A job/resume website trying to predict whether a user is actively looking for a job or not
    - Structuring text transcription


#### Data definition questions
- When defining the data for your learning algorithm, here are some important questions:
    - What is the input, `x`?
        - Lighting? Contrast? Resolution?
        - What features need to be included?
    - What is the target label, `y`?
        - How can we ensure labelers give consistent labels?

### Major Types of Data Problems
- The best practices for organizing data for one type of machine learning project, can be quite different than the best practices for totally different types.
- Below, the definition of "small" and "large" datasets is a bit arbitrary and may vary for your own projects:

<img src='img/1.png' width="600" height="300" align="center"/>

- **For a lot of unstructured learning problems:**
     - **Humans** can help with labeling
     - Data augmentation, such as synthesizing your images or synthesizing your audio, can help.
- **For structured learning problems:**
    - Harder to obtain more data
    - Harder to use data augmentation
- **For relatively small datasets:**
    - Having clean and consistent labels are ritical 
- **For relatively large datasets:**
    - Emphasis on data process
    
### Unstructured vs. structured data
#### Unstructured data
   - May or may not have huge collection of unlabeled examples `x`.
   - Humans can label more data.
   - Data augmentation more likely to be helpful.
#### Structured data
   - May be more difficult to obtain more data.
   - Human labeling may not be possible (with some exceptions). 
   
### Small data vs. big data
- (using a slightly arbitrary threshold of whether you have more or less than 10,000 training examples
#### Small data
   - Clean labels are critical.
   - Can manually look through dataset and fix labels.
   - Can get all the labelers to talk to each other. 
#### Big data
   - Emphasis on data process.


- This classification (of unstructured vs. strucured and small vs big data) can be helpful in predicting not just whether data processes generalize from one to another problem, but also whether other machine learning ideas generalize from one to another
- Tip: If you are working on a problem from one of the four quadrants above, then usually, advice from someone that has worked on projects in the same quadrant will probably be more useful/applicable than advice from someone that's worked in a different quadrant.

## Small data and label consistency
- When you have a small dataset, it can be hard to fit a function confidently to the data
- Because a lot of AI practices have been originally established by large companies with massive datasets and data warehouses, the practices for how to deal with small datasets have not been emphasized as would be needed to tackle problems where you don't have a hundred million examples, but only a thousand (or fewer).
- However, if you have a small dataset and also clean and consistent labels, you can much more confidently fit a function to your (small) dataset. 

#### Big data problems can have small data challenges too
- Problems with a large dataset but there's a **long tail of rare events** in the input, will have small data challenges too. Examples:
    - Web search
    - Self-driving cars
    - Product recommendation systems
    

### Improving label consistency
- Have multiple labelers label the same example
- Have the same labeler label the same example multiple times
- When there is disagreement, have MLE, subject matter expert (SME), and/or labelers discuss definition of `y` to reach agreement.
- If labelers believe that `x` doesn't contain enough information, consider changing `x`.
- Iterate until it is hard to significantly increase agreement.

#### Examples
- Standardize labels
- Merge classes 
    - "deep scratch" + "shallow scratch" = "scratches"
- Have a class/label to capture uncertainty
    - Defect: 0 or 1
    - Alternatice: `0`, `Borderline`, `1`
    - Unintelligible audio: `[unintelligible]`
    
#### Small data vs. big data (unstructured data)
   - **Small data:**
       - Usually small number of labelers
       - Can ask labelers to dicuss specific labels
   - **Big data:**
       - Get to consistent definition with a small group
       - Then send labeling instructions to labelers
       - Can consider having multiple labelers label every example using voting or consensus labels to increase accuracy. $\Leftarrow$ Can be over-used, should be used as a last resort
    

### Human Level Performance (HLP)
- HLP can provide a helpful baseline performance (in terms of what might be possible, especially with unstructured data), but it is also sometimes misused. 
#### Why measure HLP?
- Estimate Bayes error/irreducible error to help with error analysis and prioritization. 
- One question that is not often asked is: What exactly is this "Ground Truth Label"? $\Rightarrow$ Are we really measuring what is possible, or are we measuring how well two people agree with eachother (since ground truth is often also determined by a human)
- In academia, HLP is often used as a respectable benchmark, but:
    - this is partially because it has proven a tried and true formular for establishing the academic significance of a piece of work.
- However, businesses need systems that do more than just well of average test set accuracy.

#### The problem with HLP
- The problem with using 