# Selecting and Training a Model
- Some best practice for building a machine learning model that is worthy of a production deployment
- Model-centric AI development vs Data-centric AI development

### Key Challenges
AI system = Code (algorithm/model) + Data
- A lot of times it will be more efficient to spend more of your time improving the data because the data usually has to be much more customized to your problem 
- When building a model, there are three key milestones that most models should aspire to accomplish:
    - 1. Doing well on training set (usually measured by average training error).
    - 2. Doing well on dev/test sets.
    - 3. Doing well on business metrics/project goals.
    
### Why low average error isn't good enough
#### Performance on disproportionately important examples
- Web search example:
    - **Informational and transactional queries:**
        - "Apple pie," "Wireless data plan," "Latest movies," "Diwali festival."
        - For informational and transactional queries, a web search engine wants to return the most relevant results, but users are willing to forgive (maybe) ranking the best result as number 2 or 3
    - **Navigational queries:**
        - "Stanford," "Reddit," "YouTube."
        - Here the user has a very clear desire to go to a particular place
        - When a user has a very clear navigational intent, they will tend to be very unforgiving if the websearch engine does anything other than return the exact result
        - A web search engines that doesn't return the best results will very quickly lose the trust of its users.
        - So, navigational queries in this context are disproportionately important set of examples
- The problem is that average test set accuracy tends to weight all examples equally, whereas in many cases some scenarios are disproportionately important. 

#### Performance on key slices of the dataset
- **Example: ML for loan approval**
    - Make sure not to discriminate by ethnicity, gender, location, language, or other protected attributes
    - **Even if a learning algorithm for loan approval achieves high average test set accuracy, it would not be acceptable for production deployment if it exhibits an unacceptable level of bias or discrimination.**
- **Example: Product recommendations from retailers**
    - Be careful to treat fairly all major user-, retailer-, and product- categories/groups.
    
#### Rare classes
- Skewed data distributions
- Accuracy in rare classes

### Establish a baseline
- HLP (Human-Level Performance) is often a good point of comparison or a baseline that helps you decide where to focus your efforts

<img src='img/1.png' width="600" height="300" align="center"/>

#### Unstructured and structured data
- It turns out the best practices for establishing a baseline are quite different depending on whether you are using unstructured or structured data. 
- Because humans are so good at interpreting **unstructured data**, using HLP as a comparison is often a good way of establishing a baseline
- In contrast, because humans are not as good at looking at **structured (tabular) data** and making predictions, HLP is generally a less useful baseline.
- In general, ML best practices are typically very different depending on whether you're working with structured or unstructured data.

<img src='img/2.png' width="600" height="300" align="center"/>

#### Ways to establish a baseline
- Human level performance (HLP) $\rightarrow$ particularly for unstructured data problems. 
- Literature seach for state-of-the-art/open source (to see what others are able to accomplish).
- Quick and dirty implementation 
- Performance of an older system (for example, if you already have a machine learning system running and are looking to replace it).

**Baseline helps to indicate what might be possible. In some cases (such as HLP) it also gives a sense of what is irreducible error/Bayes' error.**

#### Tips for getting started on modeling
- Literature to see what's possible (courses, blogs, open-source projects).
- Find open-source implementations if possible.
- **A reasonable algorithm with good data will often outperform great a great algorithm with not so good data.**

#### Deployment constraints when picking a model
- Should you take into account deployment constraints (such as compute constraints) when picking a model?
    - **Yes**, if baseline is already established and goal is to build and deploy.
    - **No**, (or not necessarily), if purpose is to establish a baseline and determine waht is possible and might be worth pursuing.

#### Sanity-check for code and algorithm
- Try to overfit a small training dataset before training on a large one (especially if the output is a complex output).
    - Example \#1: Speech recognition
    - Example \#2: Image segmentation
    - Example \#3: Image classification

## Error Analysis and Performance Auditing
- Error analysis within a Jupyter notebook or Excel spreadsheet is common, but there are also emerging MLOps tools to make this process much more accurate, efficient, and in some cases, automated.

<img src='img/3.png' width="600" height="300" align="center"/>

- Error analysis is an iterative process

<img src='img/4.png' width="600" height="300" align="center"/>

#### Useful metrics for each tag
- What fraction of errors has that tag?
- Of all the data with that tag, what fraction is misclassified?
- What fraction of all the data has that tag?
- How much room for improvement is there on data with that tag?

### Prioritizing what to work on
- How can we use the above-described tags to prioritize what we work on?
- Beside "Gap to HLP," one other useful performance to look at is **"% of data"**
- Using these two columns together, we see that it may be best to prioritize either/both **Clean Speech** and/or **People Noise**.

<img src='img/5.png' width="600" height="300" align="center"/>

#### Decide on most important categories to work on based on:
   - How much room for improvement there is
   - How frequently that category appears
   - How easy (or not) it is to improve accuracy in a particular category
   - How important it is to improve in that category
   

* There is no mathematical formula that will tell you what to work on, but by looking at these factors you'll hopefully be able to make some fruitful decisions. 

#### Adding/improving data for specific categories
- For catgories you want to prioritize
    - Collect more data
    - Use data augmentation to get more data
    - Improve label accuracy/data quality
    

### Skewed Datasets 
- Datasets where the ratio of positive to negative examples is very far from 50-50 are called **skewed data sets.**
- **Confusion Matrix:** Actual vs. Predicted
- **Precision:** $$\frac{TP}{TP + FP}$$
- **Recall:** $$\frac{TP}{TP+FN}$$

- The metrics of precision and recall are more useful than raw accuracy when it comes to evaluating the performance of learning algorithms on very skewed datasets

#### Combining precision and recall- $F_1$ score
- **The $F_1$ score is a common way of combining precision and recall that emphasizes whichever (of P or R) is worse.**
- One intuition behind the $F_1$ score is that you want an algorithm to do well on both precision and recall, and if it does worse on either precision or recall, that's pretty bad.
- The $F_1$ score is a **harmonic mean** (which is like taking an average, but putting an emphasis on whichever is the lower number).
- **$F_1$ score:** $$\frac{2}{\frac{1}{P}+\frac{1}{R}}$$

<img src='img/6.png' width="600" height="300" align="center"/>

- **Note that for your application, you may have a different weighting between precision and recall, so the $F_1$ score isn't the only way to combine precision and recall**, it's just one metric that's commonly used for many applications.
- Precision and recall are not just useful for binary classification problems, but also for multi-class classification problems 
- You'll find in manufacturing that many factories will want high recall because you really don't want, for example, to let a phone go out that is defective. But if an algorithm has slightly lower precision that's okay because through a human re-examining the phone, they will hopefully figure out that the phone is actually okay. 
- By combining precision and recall with the $F_1$ score, this gives you a single evaluation metric for how well your algorithm is doing. 
- In the example below, if each of the four types of defects were very rare, then the accuracy would be extremely high for each category, even if the model was very bad at detecting any defects:

<img src='img/7.png' width="600" height="300" align="center"/>

<img src='img/x.png' width="600" height="300" align="center"/>