# Selecting and Training a Model
- Some best practice for building a machine learning model that is worthy of a production deployment
- Model-centric AI development vs Data-centric AI development

### Key Challenges
AI system = Code (algorithm/model) + Data
- A lot of times it will be more efficient to spend more of your time improving the data because the data usually has to be much more customized to your problem 
- When building a model, there are three key milestones that most models should aspire to accomplish:
    - 1. Doing well on training set (usually measured by average training error).
    - 2. Doing well on dev/test sets.
    - 3. Doing well on business metrics/project goals.
    
### Why low average error isn't good enough
#### Performance on disproportionately important examples
- Web search example:
    - **Informational and transactional queries:**
        - "Apple pie," "Wireless data plan," "Latest movies," "Diwali festival."
        - For informational and transactional queries, a web search engine wants to return the most relevant results, but users are willing to forgive (maybe) ranking the best result as number 2 or 3
    - **Navigational queries:**
        - "Stanford," "Reddit," "YouTube."
        - Here the user has a very clear desire to go to a particular place
        - When a user has a very clear navigational intent, they will tend to be very unforgiving if the websearch engine does anything other than return the exact result
        - A web search engines that doesn't return the best results will very quickly lose the trust of its users.
        - So, navigational queries in this context are disproportionately important set of examples
- The problem is that average test set accuracy tends to weight all examples equally, whereas in many cases some scenarios are disproportionately important. 

#### Performance on key slices of the dataset
- **Example: ML for loan approval**
    - Make sure not to discriminate by ethnicity, gender, location, language, or other protected attributes
    - **Even if a learning algorithm for loan approval achieves high average test set accuracy, it would not be acceptable for production deployment if it exhibits an unacceptable level of bias or discrimination.**
- **Example: Product recommendations from retailers**
    - Be careful to treat fairly all major user-, retailer-, and product- categories/groups.
    
#### Rare classes
- Skewed data distributions
- Accuracy in rare classes

### Establish a baseline
- HLP (Human-Level Performance) is often a good point of comparison or a baseline that helps you decide where to focus your efforts

<img src='img/1.png' width="600" height="300" align="center"/>

#### Unstructured and structured data
- It turns out the best practices for establishing a baseline are quite different depending on whether you are using unstructured or structured data. 
- Because humans are so good at interpreting **unstructured data**, using HLP as a comparison is often a good way of establishing a baseline
- In contrast, because humans are not as good at looking at **structured (tabular) data** and making predictions, HLP is generally a less useful baseline.
- In general, ML best practices are typically very different depending on whether you're working with structured or unstructured data.

<img src='img/2.png' width="600" height="300" align="center"/>

#### Ways to establish a baseline
- Human level performance (HLP) $\rightarrow$ particularly for unstructured data problems. 
- Literature seach for state-of-the-art/open source (to see what others are able to accomplish).
- Quick and dirty implementation 
- Performance of an older system (for example, if you already have a machine learning system running and are looking to replace it).

**Baseline helps to indicate what might be possible. In some cases (such as HLP) it also gives a sense of what is irreducible error/Bayes' error.**

#### Tips for getting started on modeling
- Literature to see what's possible (courses, blogs, open-source projects).
- Find open-source implementations if possible.
- **A reasonable algorithm with good data will often outperform great a great algorithm with not so good data.**

#### Deployment constraints when picking a model
- Should you take into account deployment constraints (such as compute constraints) when picking a model?
    - **Yes**, if baseline is already established and goal is to build and deploy.
    - **No**, (or not necessarily), if purpose is to establish a baseline and determine waht is possible and might be worth pursuing.

#### Sanity-check for code and algorithm
- Try to overfit a small training dataset before training on a large one (especially if the output is a complex output).
    - Example \#1: Speech recognition
    - Example \#2: Image segmentation
    - Example \#3: Image classification

## Error Analysis and Performance Auditing
- Error analysis within a Jupyter notebook or Excel spreadsheet is common, but there are also emerging MLOps tools to make this process much more accurate, efficient, and in some cases, automated.

<img src='img/3.png' width="600" height="300" align="center"/>

- Error analysis is an iterative process

<img src='img/4.png' width="600" height="300" align="center"/>

#### Useful metrics for each tag
- What fraction of errors has that tag?
- Of all the data with that tag, what fraction is misclassified?
- What fraction of all the data has that tag?
- How much room for improvement is there on data with that tag?

### Prioritizing what to work on
- How can we use the above-described tags to prioritize what we work on?
- Beside "Gap to HLP," one other useful performance to look at is **"% of data"**
- Using these two columns together, we see that it may be best to prioritize either/both **Clean Speech** and/or **People Noise**.

<img src='img/5.png' width="600" height="300" align="center"/>

#### Decide on most important categories to work on based on:
   - How much room for improvement there is
   - How frequently that category appears
   - How easy (or not) it is to improve accuracy in a particular category
   - How important it is to improve in that category
   

* There is no mathematical formula that will tell you what to work on, but by looking at these factors you'll hopefully be able to make some fruitful decisions. 

#### Adding/improving data for specific categories
- For catgories you want to prioritize
    - Collect more data
    - Use data augmentation to get more data
    - Improve label accuracy/data quality
    

### Skewed Datasets 
- Datasets where the ratio of positive to negative examples is very far from 50-50 are called **skewed data sets.**
- **Confusion Matrix:** Actual vs. Predicted
- **Precision:** $$\frac{TP}{TP + FP}$$
- **Recall:** $$\frac{TP}{TP+FN}$$

- The metrics of precision and recall are more useful than raw accuracy when it comes to evaluating the performance of learning algorithms on very skewed datasets

#### Combining precision and recall- $F_1$ score
- **The $F_1$ score is a common way of combining precision and recall that emphasizes whichever (of P or R) is worse.**
- One intuition behind the $F_1$ score is that you want an algorithm to do well on both precision and recall, and if it does worse on either precision or recall, that's pretty bad.
- The $F_1$ score is a **harmonic mean** (which is like taking an average, but putting an emphasis on whichever is the lower number).
- **$F_1$ score:** $$\frac{2}{\frac{1}{P}+\frac{1}{R}}$$

<img src='img/6.png' width="600" height="300" align="center"/>

- **Note that for your application, you may have a different weighting between precision and recall, so the $F_1$ score isn't the only way to combine precision and recall**, it's just one metric that's commonly used for many applications.
- Precision and recall are not just useful for binary classification problems, but also for multi-class classification problems 
- You'll find in manufacturing that many factories will want high recall because you really don't want, for example, to let a phone go out that is defective. But if an algorithm has slightly lower precision that's okay because through a human re-examining the phone, they will hopefully figure out that the phone is actually okay. 
- By combining precision and recall with the $F_1$ score, this gives you a single evaluation metric for how well your algorithm is doing. 
- In the example below, if each of the four types of defects were very rare, then the accuracy would be extremely high for each category, even if the model was very bad at detecting any defects:

<img src='img/7.png' width="600" height="300" align="center"/>

### Performance Auditing
- Even when your learning algorithm is doing well on accuracy or F1 score or some appropriate metric, it's often worth one last performance audit before you push it to production.
#### Auditing framework 
- Check for accuracy, fairness/bias, and other problems 
    - 1. Brainstorm the ways the system might go wrong.
        - Performance on subsets of data (e.g., ethnicity, gender).
        - How common are certain errors (e.g., FP, FN).
        - Performance on rare classes
    - 2. Establish metrics to assess performance against these issues on appropriate **slices of data.**
        - After establishing appropriate metrics, MLOps tools can also help trigger an automatic evaluation for each model to audit performance (e.g., TFMA)
    - 3. Get business/product owner buy-in.

#### Speech recognition example
- 1. Brainstorm the ways the system might go wrong:
    - Accuracy on different genders and ethnicities.
    - Accuracy on different devices.
    - Prevalence of rude mis-transcriptions.
- 2. Establish metrics to assess performance against these issues on appropriate slices of data:
    - Mean accuracy for different genders and major accents.
    - Mean accuracy on different devices.
    - Check for prevalence of offensive words in the output
    - *The ways that a system might go wrong turns out to be very problem dependent: different industries, different tasks will have very different standards.*

## Data iteration
### Data-centric AI development

- **Model-centric view:** Take the data you have, and develop a model that does as well as possible on it
    - Most academic research in AI is model-centric because the benchmarked dataset is a fixed quantity
    - **Hold the data fixed and iteratively improve the code/model.**
- **Data- centric view:** The quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.
    - Tools include error analysis and data augmentation
    - **Hold the code fixed and iteratively improve the data.** 
    

- There's a role for model-centric development and there's a role for data-centric development.

### A useful picture of data augmentation

<img src='img/8.png' width="600" height="300" align="center"/>

<img src='img/9.png' width="600" height="300" align="center"/>

- It turns out that for unstructured data problems, pulling up one piece of the "rubber sheet" in the previous example is unlikely to cause a different part of the sheet to dip down far below. 
- Instead pulling up one point causes nearby points to be pulled up quite a lot, and far away points to maybe be pulled up a little bit (or if you're lucky, maybe more than a little bit). 
- And when you pull up part of the rubber sheet, the location of the biggest gap may shift to somewhere else and error analysis will tell you what is the location of this new biggest gap

<img src='img/10.png' width="600" height="300" align="center"/>

### Data Augmentation

<img src='img/11.png' width="600" height="300" align="center"/>

- **Goal:** is to create examples that your learning algorithm can learn from.
    - Create **realistic examples** that (i) **the algorithm does poorly on**, but (ii) **humans (or other baseline do well on.**
- **Checklist:** 
    - 1. Does it sound realistic?
    - 2. Is the `x` $\rightarrow$ `y` mapping clear? (e.g., can humans recognize speech?)
    - 3. Is the algorithm currently doing poorly on it?
    

- **Taking a data-centric approach to AI development, sometimes it's useful to use a *data iteration loop* (rather than model iteration.**

### Can adding data hurt?
- For a lot of ML problems, training sets and dev and test set distribution strt out being reasonably similar. But, if you're using data augmentation, you're adding to specific parts of the training set such as adding lots of data with cafe noise. So now your training set may come from a very different distribution than the dev set and test set.
- Is this going to hurt your learning algorithm? Usually the answer for unstructured data the answer is no (with some caveats)
- For unstructured data problems, if:
    - The model is large (low bias)
    - The mapping `x` $\rightarrow$ `y` is clear (e.g., given only the input `x`, humans can make accurate predictions).
    - Then, **adding (accurately labeled) data rarely hurts accuracy.**
    
#### Photo OCR counterexample
- Some images are truly ambiguous
- Adding a lot of new "I"s (***especially ambiguous examples***) may skew the dataset and hurt performance
- Because we know there are a lot more 1s than Is on house numbers, if the learning algorithm sees a picture like the one on the right, it would be much safer to guess that it is a 1.
- This is an example of when the mapping of `x` $\rightarrow$ `y` is not clear.
- Just to be clear, this example is a pretty rare, almost corner case. It is quite unusually for data augmentation or adding data to hurt the performance of an ML algorithm.

<img src='img/11.png' width="600" height="300" align="center"/>

### Adding features
- For many structured data problems, it turns out that creating brand new training examples is difficult, but what you can do is take existing training examples and figure out if there are additional useful features you can add to it. 
- For structured data problems, you usually have a fixed number of observations, and it's difficult if not impossible to add more. Instead, make new features from existing observations.

<img src='img/12.png' width="600" height="300" align="center"/>

#### Recommender Engines
- Over the last several years, there’s been a trend in product recommendations of a shift from collaborative filtering approaches to what content based filtering approaches
- Collaborative filtering —> tries to find users similar to you and then recommend items that those users liked
- Content-based filtering --> Will tend to look at you as a person, and the description of the restaurant or the menu of the restaurant to see if that restaurant is a good match. 
	- The advantage of content-based filtering is that, even if there is a new restaurant or a new product that hardly anyone else has liked, you can more quickly make good recommendations
	- “the cold start problem”: how do you recommend a brand new product — make sure that you capture good features

## Experiment Tracking

- Rather than. worrying too much about exactly which experiment tracking framework to use, the number one thing to take away from this section is, do try to have some system, even if it's just a text file or just a spreadsheet for keeping track of your experiments, and include as much information as is convenient to inglude
- **What to track:**
    - Algorithm/code versioning
    - Dataset used
    - Hyperparameters
    - Results
- **Tracking tools:**
    - Text files (does not scale well)
    - Spreadsheets (scale much further, especially shared spreadsheets)
    - Experiment tracking system
- **Desirable features:**
    - Information needed to replicate results?
        - does your learning algorithm pull data off the internet? This can make experiments less reproducible
    - Experiment results, ideally with summary metrics/analysis
    - Perhaps also: Resource monitoring, visualization, model error analysis
- The space of experiment tracking systems is still evolving rapidly and so there's a growing set of tools out there. But some examples include:
    - W and B
    - ***Comet***
    - MLFlow
    - Sage Maker Studio
    - Landing.AI $\rightarrow$ focuses on computer vision and manufacturing applications

### From big data to good data
- Try to ensure consistently high-quality data in all phases of the ML project lifecycle.
- Good data:
    - Covers important cases (good coverage of inputs `x`)
    - Is defined consistently (definition of labels `y` is unambiguous)
    - Has timely feedback from production data (distribution covers data drift and concept drift)
    - Is sized appropriately

### More label ambiguity examples
- A common application in many large companies is user ID merge
    - **User ID merge:** When you have multiple data records that you think correspond to the same person and you want to merge these user records together.
    - One scenario where this commonly occurs is when one company purchase or merges with another company and a user has accounts with each (often with not-identical information in each)
    - One approach to the User ID merge problem is to take a supervised ML algorithm that takes as input two user data records and tries to output either one or zero based on whether it thinks these two are actually the sme user
- Other examples with ambiguous ground truths: 
    - Is predicting if an account is a spam/fake/bot account
    - Is an online purchase fraudulent?
    - A job/resume website trying to predict whether a user is actively looking for a job or not
    - Structuring text transcription


#### Data definition questions
- When defining the data for your learning algorithm, here are some important questions:
    - What is the input, `x`?
        - Lighting? Contrast? Resolution?
        - What features need to be included?
    - What is the target label, `y`?
        - How can we ensure labelers give consistent labels?

<img src='img/x.png' width="600" height="300" align="center"/>