# Advice for Applying Machine Learning

## Evaluating a Learning Algorithm

### Deciding What to Try Next

What should you do if your (e.g.) regularised linear regression model is making unacceptably large errors in its predictions?

Options include:

+ Collect more training examples.
+ Try smaller sets of features.
+ Try getting additional features.
+ Try adding polynomial features.
+ Try decreasing $\lambda$.
+ Try increasing $\lambda$.

How do you know which to try first?

Machine learning diagnostic:  
A diagnostic is a test that you can run to gain insight into what is or isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

Diagnostics can take time to implement, but are worth it in the end.

### Evaluating a Hypothesis

A really low value of training error might indicate overfitting rather than accuracy. For this reason you should always divide your (randomly ordered) data into a training set and a test set - a 70:30 split is typical - and check the error on the test set after the model has been trained with the training set data. A low value for training error and a high value for test error suggests overfitting.

Procedure:  

+ Learn parameter $\theta$ from training data (minimising training error $J(\theta)$).
+ Compute test error (for e.g. linear regression): $$J_{test}(\theta) = \frac{1}{2m} \sum^{m_{test}}_{i = 1} (h_\theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$$
+ Compare test error with training error.

For logistic regression, you can use test error as follows:

$$J_{test}(\theta) = -\frac{1}{m_{test}} \sum^{m_{test}}_{i = 1} y^{(i)}_{test} \log h_\theta(x^{(i)}_{test}) + (1 - y^{(i)}_{test})\log h_\theta(x^{(i)}_{test}) $$

Or an alternative error metric, the misclassification error:

$$ err(h_\theta(x), y) =
\begin{cases}
 1 \text{ if } h_{x} \geq 0.5, y = 1 \text{ or } h_{x} < 0.5, y = 0 \\
 0 \text{ otherwise}
\end{cases} $$

That is, 1 if the sample was misclassified, 0 otherwise. Then average these values to give you an idea of how many samples your hypothesis is misclassifying:

$$ \text{test error } = \frac{1}{m_{test}} \sum^{m_{test}}_{i = 1} err(h_\theta(x^{(i)}_{test}), y^{(i)}_{test}) $$

### Model Selection and Train/Validation/Test Sets

How do you decide what degree of polynomial to fit to a data set? Remember that fitting well to a training set doesn't tell you much about how well a model will generalise.

Options:

1\. $h_\theta(x) = \theta_0 + \theta_1x; d = 1$<br>
2\. $h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2; d = 2$<br>
3\. $h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3; d = 3 \\ \vdots \\$<br>
10\. $h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \dots + \theta_{10}x^{10}; d = 10$<br>

$d$ is the degree of polynomial, which you can think of as an extra parameter. If you choose which $d$ to use based on $J(\theta)$ values for the test set, then the parameter $d$ will have been fitted to the test set and you can't tell if it will generalise.

So, split your data into a training set, cross validation set, and test set (a typical ratio is 60:20:20). Then use the cross validation set to fit the parameter $d$ and check how well it generalises with the test set.

## Bias vs. Variance

TODO

## Building a Spam Classifier

### Prioritizing What to Work On

When building a spam classifier you need to decide how to represent the input features. You could look through a set of labelled training data and choose a list of the most common words present in spam and not-spam emails, and represent their presence or absence as 1 or 0 in a feature vector.  What would be the best way to spend your time to improve the accuracy of the classifier? Several options present themselves:

+ Collect lots of data.
+ Develop sophisticated features based on email routing information (from the email header).
+ Develop sophisticated features for the message body, e.g. should 'deal' and 'dealer' be considered the same word?
+ Develop sophisticated ways of processing your input data, like detecting deliberate misspellings designed by spammers to get around classifers, e.g. 'w4tches'.

It's not obvious which of these is the most promising option.

### Error Analysis

Recommended approach:  

+ Start with a simple quick and dirty algorithm that you can implement quickly (~24 hrs). Implement it and test it on your cross-validation data.
+ Plot learning curves to decide if more data, more features, etc. are likely to help. This is a better approach than spending ages on the initial implementation, because at that point it's not clear what will be the most valuable use of your time (premature optimisation). So it's important to get something quickly and then base how you're going to spend your time on evidence.
+ Error analysis: Manually examine the examples (in the cross validation set) that your algorithm misclassified. See if you can spot any systematic trend in what type of examples it is making errors on. For example, spam classifier might misclassify 100 emails, and when you look at them you can categorise them as:
    - Pharma: 12
    - Replica/fake: 4
    - Phishing: 53
    - Other: 31  
So you know you need to think about what features could help your algorithm classify phishing emails correctly.

It is very important to get error results as a single, numerical value (e.g. cross validation error). This is so you can compare the algorithm's performance under certain conditions with its performance under others. For example, you might not be sure whether or not stemming software would improve your algorithm's performance, so you need to try it with and without and compare the performance of the two.

## Handling Skewed Data

### Error Metrics for Skewed Classes

Say we had an algorithm for predicting whether or not a person has cancer, and this algorithm has 1% cross validation error. However, only 0.5% of patients actually have cancer so on this metric an algorithm that predicted never predicted cancer would outperform it with only 0.5% error. In these sorts of cases where we have skewed classes such that the number of positive cases is much smaller than the number of negative cases, we need to turn to alternative error metrics to evaluate our algorithms.

<table style="border:none;">
    <tr style="border:none;">
        <td style="border:none;"></td>
        <td style="border:none;"></td>
        <th colspan="2" style="text-align:center;">Actual class</th>
    </tr>
    <tr style="border:none;">
        <td style="border:none;"></td>
        <td style="border:none;"></td>
        <th style="text-align:center;">1</th>
        <th style="text-align:center;">0</th>
    </tr>
    <tr>
        <th rowspan="2" style="text-align:center;">Predicted<br>class</th>
        <th>1</th>
        <td>True positive</td>
        <td>False positive</td>
    </tr>
    <tr>
        <th>0</th>
        <td>False negative</td>
        <td>True negative</td>
    </tr>
</table>

$y = 1$ in presence of rare class we want to detect.

**Precision**  
"Of all patients where we predicted $y = 1$, what fraction actually has cancer?"

$$
\frac{\text{# True positives}}{\text{# Predicted positives}} = \frac{\text{# True positives}}{\text{# True positives} + \text{# False positives}}
$$

**Recall**  
"Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?"

$$
\frac{\text{# True positives}}{\text{# Actual positives}} = \frac{\text{# True positives}}{\text{# True positives} + \text{# False negatives}}
$$

An algorithm that predicted never predicted cancer would have a recall of 0, so we can see that it's not a good classifier.

### Trading Off Precision and Recall

Say we had a logistic regression classifer such that:

$$
\text{Predict } 1 \text{ if } h_\theta(x) \geq 0.5\\
\text{Predict } 0 \text{ if } h_\theta(x) < 0.5
$$

But then we decided that we wanted to only predict a positive case if we were extra confident, so we changed the threshold:

$$
\text{Predict } 1 \text{ if } h_\theta(x) \geq 0.7\\
\text{Predict } 0 \text{ if } h_\theta(x) < 0.7
$$

This would result in higher precision and lower recall. Now say we wanted to do the opposite:

$$
\text{Predict } 1 \text{ if } h_\theta(x) \geq 0.3\\
\text{Predict } 0 \text{ if } h_\theta(x) < 0.3
$$

This would result in higher recall and lower precision.

In general you can plot precision against recall as the threshold varies and the result will be a downwards curve (although its exact shape will depend on the details of the classifer).

Is there a way to automatically choose the threshold to make sure you're using the best algorithm?

The problem with using precision and recall as error metrics is that you no longer have a single numerical value you can use to compare algorithms. For example, it's not obvious which algorithm here performs the best:

<table>
    <tr>
        <td></td>
        <th>Precision (P)</th>
        <th>Recall (R)</th>
    </tr>
    <tr>
        <th>Algorithm 1</th>
        <td>0.5</td>
        <td>0.4</td>
    </tr>
    <tr>
        <th>Algorithm 2</th>
        <td>0.7</td>
        <td>0.1</td>
    </tr>
    <tr>
        <th>Algorithm 3</th>
        <td>0.02</td>
        <td>1.0</td>
    </tr>
</table>

How can we combine precision and recall into a single useful value? We could try averaging them, but that won't work very well because, for example, if we have a classifier that predicts $y = 1$ all the time, then you can get a very high recall, but a very low value of precision.

So instead we use the $F_1$ Score:

$$
F_1 \text{ Score } = 2 \frac{PR}{P + R}
$$

Which is similar to taking the average, but it gives the lower value a higher weight. In general:

$$
P = 0 \text{ OR } R = 0 \Rightarrow F_1 \text{ Score } = 0 \\
P = 1 \text{ AND } R = 1 \Rightarrow F_1 \text{ Score } = 1
$$

If we calculate the $F_1$ Score for the algorithms:

<table>
    <tr>
        <td></td>
        <th>Precision (P)</th>
        <th>Recall (R)</th>
        <th>$F_1$ Score</th>
    </tr>
    <tr>
        <th>Algorithm 1</th>
        <td>0.5</td>
        <td>0.4</td>
        <td>0.444</td>
    </tr>
    <tr>
        <th>Algorithm 2</th>
        <td>0.7</td>
        <td>0.1</td>
        <td>0.175</td>
    </tr>
    <tr>
        <th>Algorithm 3</th>
        <td>0.02</td>
        <td>1.0</td>
        <td>0.0392</td>
    </tr>
</table>

Then we can see that algorithm 1 is the one that we should use.

## Using Large Data Sets

### Data for Machine Learning

Under certain conditions the following saying holds true:

> *It's not who has the best algorithm that wins. It's who has the most data.*

What conditions are these?

Assume feature $x \in \mathbb{R}^{n + 1}$ has sufficient information to predict $y$ accurately. (A useful test: Given the input $x$, can a human expert confidently predict $y$?)

If you use a learning algorithm with many parameters (e.g. logistic regression / linear regression with many features; a neural network with many hidden units) then this algorithm will have low bias, and so $J_{\text{train}}(\theta)$ will be low.

If you also have a large training set, then your algorithm will be unlikely to overfit, i.e. it will have low variance, and so $J_{\text{train}}(\theta) \approx J_{\text{test}}(\theta)$.

Combined, this should give you an effective algorithm will low bias and low variance.