# Advice for Applying Machine Learning Algorithms

### Deciding What to Try Next

- What if the chosen algorithm returns unacceptably large errors when applied to the hypothesis?
    - Get more training data
    - Try smaller set of features
    - Try getting additional features
    - Add polynomial features to best fit the training data
    - Increase (or decrease) the regularization parameter $\lambda$
- Machine Learning Diagnostic: A test that can be ran to gain insight into what is/isn't working with a learning algorithm and gain guidance as to how to best improve its performance

### Evaluating a Hypothesis

- Evaluating the hyposthesis begins by splitting the data into training and testing sets
- Training/Testing Procedure for Linear Regression:
    1. Learn parameter $\theta$ from training data (minimizing the training error $J(\theta)$
    2. Compute test set error:
        - $J_{test}(\theta) = \frac{1}{2m_{test}}\sum \limits_{i=1}^{m_{test}}(h_\theta(x_{test}^{(i)})-y_{test}^{(i)})^2$
        - The process for Logistic Regression is similar
        - Another process includes the **misclassification error (0/1 misclassification error)**:
             ![image.png](attachment:image.png)
             
### Model Selection and Training/Validation/Test Sets

- How to choose which degree of polynomials should be used to maximize model performance? In addition to the $\theta$ parameter value, there is also the **d=degree of polynomial** variable that should be considered
    - Split dataset into three different sets:
        1. Training Set
        2. Cross-Validation Set
        3. Test Set
        
### Diagnosing Bias vs. Variance

![image-2.png](attachment:image-2.png)

### Regularization and Bias/Variance

![image-3.png](attachment:image-3.png)

### Learning Curves

![image-4.png](attachment:image-4.png)

### Deciding What to Do Next Revisited

- Get more training data **-->** fixes high variance
- Try smaller set of features **-->** fixes high variances
- Try getting additional features **-->** usually fixes high bias problems
- Add polynomial features to best fit the training data **-->** fixes high bias
- Increase (or decrease) the regularization parameter $\lambda$ **-->** decreasing $\lambda$ fixes high bias, while increasing $\lambda$ fixes high variance

- For neural networks, using regularization on larger networks is often used to address overfitting


### Training and Cross Validation Addendum

- Particularly for small datasets or datasets with a ride range of feature values, a good practice is to choose a randomize set of examples when calculating the training set and validation errors:
    1. Randomly select i examples from the training set and i examples from the cross validation set
    2. Calculate $\theta$ using the selected training set examples
    3. Evaluate the error using the parameters $\theta$ on the randomly chosen training set and cross validation set
    4. Repeat multiple times and collect the errors for each iteration
    5. The average error should then be used to determine the training error and cross validation for i examples.
    
### Machine Learning System Design: Prioritizing What to Work On

- Example: Building a Spam Classifier
    - Supervised Learning
        - x = features of email
        - y = spam(1) or y = not spam(0)
        - features x: Choose 100 words indicative of spam/not spam
            - Note: In practice, take most frequently occuring words (perhaps 10,000-50,000) in training set rather than choose 100 words manually
    - How to spend time to make model have low error?
        - Collect lots of data
            - e.g. "honeypot" project
        - Develop sophisticated features based on email routing information (from the email header)
        - Develop sophisticated features for message body, e.g. should "discount" and "discounts" be treated as the same word? Or "deal" and "Dealer"? Features regarding punctuation?
        - Create algorithm to detect misspellings in the email
        
### Machine Learning System Design: Error Analysis

- Recommneded Approach:
    1. Start with a simple algorithm that can be implemented very quickly. Then test the algorithm/model using the validation data.
    2. Plot learning curves
    3. Error analysis: manually examine the examples in the cross validation set that the algorithm made errors on and see if any systematic trends exist in what type of examples the model fails to generalize or make accurate predictions on  
- Error analysis example, Spam Classifier:
    - $m_{cv}$ = 500 examples in cross validation set
    - Algortihm misclassifies 100 examples. Manually examine the 100 missclassified examples and categorize them based on:
        1. What type of email it is
        2. What features might have helped the algorithm classify them correctly
    - Numerical evaluation is also very helpful.
        - returns a single real number (e.g. accuracy or error) that reflects the overall performance of the model
        - Another option is to use stemming software, but error analysis is not always the most helpful -- the best option is to try it and see if it works/makes a positive impact on the algorithm's results
        - numerical evaluation (e.g. cross validation error) can give you a measure of the algorithm's performance with and without stemming
        
### Handling Skewed Data

- Error metrics such as accuracy are sometimes not as helpful, particular in the case of skewed classes.
- Precision/Recall is often times a better metric to use to evaluate an algorithm's performance
![image-5.png](attachment:image-5.png)
    - Precision = $\frac{t}{p+f}$ where t = **true positives** and p = **# of true positives + # of false positives**
    - Recall = $\frac{t}{a}$ where t = **true positives** and a =**# of actual positives**
    
### Trading off Precision and Recall

- If you increase the threshold of $h_\theta(x)$ from 0.5 to 0.7 or 0.9 (i.e. the hypothesis function will return 1 only if $h_\theta(x)$ is more than .7 or .9 and will return 0 otherwise), the the overall precision will increase but the recall will decrease because there is a smaller portion of the test set that will have a predicition of 1
- Inversely, if you want to be more conservative you can lower the threshold of $h_\theta(x)$ which will result in higher recall but lower precision
- $F_1$ Score (F Score) can be used to compare the precision and recall of different classification algorithms that have been trained using different thresholds
    - $F_1 Score = 2\frac{PR}{P+R}$ 