# 1. Introduction

Week 6 is about systematically designing and improving learning algorithms. Specifically Week 6 covers techniques and tips to:

1. identify when a learning algorithm is doing poorly, and describe the 'best practices' for how to 'debug' the learning algorithm.


2. optimize a machine learning algorithm, including understanding where the biggest improvements can be made and in particular how to understand the performance of a machine learning system with multiple parts, and also how to deal with skewed data; and


3. design effective machine learning systems.

# 2. Debugging a Learning Algorithm

There are various options if a machine learning algorithm returns unacceptably large errors.  These include:

1. Acquiring more training samples.


2. Trying a smaller set of features.


3. Trying a larger set of additional features.


4. Adding polynomial features, e.g. $x_1^2$, $x_2^2$, $x_1 x_2$ etc.


5. Increasing $\lambda$.


6. Decreasing $\lambda$

Rather than randomly trying one, more or all of the above, there are diagnostic techniques to guide you to the appropriate selection of tweaks necessary to improve performance.


# 3. Evaluating a Learning Algorithm

Simply minimising the cost function for the training set may not mean the algorithm is optimised.  In fact, it may only mean the algorithm is <b>only</b> optimied for <i>the training set</i>.  

In other words, it may be <b>overfitted</b> to the training set.

To deploy ML in real life we need to evaluate <b>whether the algorithm generalizes from the training set to other data</b>.  

To do so we typically split the initial dataset into different categories of subsets and apply techniques to adjust the algorithm in line with how well it performs over these different subsets.

## 3.1. Splitting into (a) Training Set + (b) Test Set

A simple way to begin evaluating a learning algorithm is to split the dataset into:

1. A <b>Training Set</b>, $x^{(i)}_{train}, y^{(i)}_{train})$; and


2. A <b>Test Set</b>, $x^{(i)}_{test}, y^{(i)}_{test})$.

The former is the dataset subset on which the algorithm is <i>trained</i>.  The latter is the dataset subset on which the algorithm is <i>tested</i> to assess its performance.

### 3.1.1. Applying a Train vs. Test split: Linear Regression

Simply do the following:

1. Split the dataset into a <b>Training</b> dataset and a Test dataset.


2. Learn the paramater(s) $\theta$ from the training data.


3. Compute the error for the <b>Test</b> dataset.

For instance, if the algorithm is overfitting the training set we would expect a low error for the training set but a high error for the test set.

### 3.1.2. Applying a Train vs. Test Split: Logistic Regression

Simply do the following:

1. Split the dataset into a <b>Training</b> dataset and a Test dataset.


2. Learn the paramater(s) $\theta$ from the training data.


3. Compute the error for the <b>Test</b> dataset.


4. Apply misclassification error (AKA 0/1 classification error):

\begin{align}
err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}
\end{align}

5. The misclassification error:

    (a) returns $1$ if $h_{\theta}(x) \geq 0.5$ but mispredicts $y = 0$ or if $h_{\theta}(x) \leq 0.5$ but mispredicts $y = 1$; and
   
    (b) returns $0$ in all other cases. 
    
    
6.  The average misclassification error can be found as follows, which returns the proportion of the test data that was misclassified:

\begin{align}
Test Error = \frac{1}{m_{test}}\sum_{i = 1}^{m_{test}}err(h_\theta(x^{(i)}_{test}), y^{(i)}_{test}) 
\end{align}


### 3.1.3. What is the ideal Train vs. Test split?

There are no hard and fast rules.  As a best practice, a $70:30$ split is usually adopted, whereby $70\%$ of the dataset forms the Training Set and the remaining $30\%$ forms the Test Set.

## 3.2. Model Selection & Train / Validation / Test Sets

### 3.4.1. The Context

When evaluating performance of <b>several different</b> models, e.g. to identify which degree  ($d$) of polynomial to fit to the data, you will do the following:

1. Define each polynomial function from $d = 1$ to $d = n$, i.e. your list of potential models.


2. For each polynomial function learn the parameter(s) $\Theta$, e.g. $\Theta^{1)}, \Theta^{(2)}$ and so on.


3. For each set of $\Theta$ values, evaluate the cost function, e.g. $J_{test}(\Theta^{(1)}$ and so on.


4. Identify which polynomial degree best fits the data, i.e. lowest error.

### 3.4.2. The Problem

However, the above does not tell you whether the chosen poynomial model generalises well or at all.  Instead it only tells you that it is the best for the train / test sets. 

### 3.4.3. The Solution

To better assess how well the chosen polynomial generalises we must split the dataset into a <b>third</b> subset rather than two.  Concretely this means we:

1. Split $60\%$ of the dataset into a <b>Training Set</b>;


2. Split $20\%$ of the dataset into a <b>Cross Validation Set</b> ("$cv$"); and


3. Split $20\%$ of the dataset into a <b>Test Set</b>.

Pulling that altogether we end up evaluating the error over three separate subsets of the total dat
aset $m$:

* $J_{train}(\theta) = \frac{1}{2m} \sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$


* $J_{cv}(\theta) = \frac{1}{2m_{cv}} \sum_{i = 1}^{m_{cv}}(h_\theta(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$


* $J_{test}(\theta) = \frac{1}{2m} \sum_{i = 1}^{m_{test}}(h_\theta(x_{test}^{(i)}) - y_{test}^{(i)})^2$

Using this <b>threeway</b> split we repeat steps $1 - 4$ described in 3.4.1. above.  However, this time we evaluate the models using the <b>cross validaton</b> set instead of the test set.  

Once the model has been selected after evaluating performance on the cross validation set, we then evaluate the error function of the selected model against the <b>test set</b>.

We usually expect $J_{cv}(\theta)$ to be <b>lower</b> than $J_{test}(\theta)$ because an extra parameter $d$ (i.e. the degree of polynomial) has been fit to the cross validation set. 

# 4. Bias vs. Variance

Most ML problems are due to having one or both of:

1. <b>high variance</b>, i.e. <b>overfitting</b>; and / or


2. <b>high bias</b>, i.e. <b>underfitting</b>.

<img src="../Images/fitting.png" width=90%/>

## 4.1. What are Bias and Variance?

### 4.1.1. Bias

* Bias = difference between: (a) predicted values; and (b) actual values.  


* If the average predicted values are <b>far off</b> the actual values then there is <b>high bias</b>.


* If the average predicted values are <b>close</b> to the actual values then there is <b>low bias</b>.


* High bias causes the alogrithm to miss relevant relationships between input and output variable, implying the model is <b>too simple</b> and thus <b>underfits</b> the data.

### 4.1.2. Variance

* Variance = model performs well on trained dataset but not well on a dataset on which it has not been trained, e.g. a test or validation dataset.


* If the predicted values are on average scattered <b>close</b> to the actual values then there is <b>high variance</b>.


* If the predicted values are on averaged scatted <b>far</b> from the actual values then there is <b>low variance</b>.


* High variance causes <b>overfitting</b> that implies the algorithm models random noise present in the training data.

### 4.1.3. Bias / Variance Combinations

1. <b>High Bias / Low Variance:</b> models consistent but inaccurate on average.


2. <b>High Bias / High Variance:</b> models inconsistent and inaccurate on average.


3. <b>Low Bias / Low Variance:</b> models accurate and consistent on average.  This is what we should aim to achieve.


4. <b>Low Bias / High Variance:</b> models somewhat accurate but inconsistent on average.  A small change in the data can cause a large error.

## 4.2. Diagnosing Bias vs. Variance

Given that bias and variance are the most common causes of machine learning problems we need to diagnose whether our algorithm suffers from one or both.

To do so we can use <b>learning curves</b> to diagnose our algorithm:

<img src="../Images/VarianceBias.png" width=80%/>

In other words:

1. If $J_{train}(\theta)$ is <b>high</b> and $J_{cv}$ is approximately equal to $J_{train}(\theta)$ then there is <b>high bias</b>.  


2. Conversely, if $J_{train}(\theta)$ is <b>low</b> and $J_{cv}$ is <b>much greater than </b>$J_{train}(\theta)$ then there is <b>high variance</b>.


3.  Therefore, we want something roughly in the middle of the two graphs.

## 4.3. Regularization + Bias / Variance

Recall that regularistion is a way to correct <b>overfitting</b> by including a penalty $\lambda$ on the parameters thereby:

* reducing the mode's likelihood of fitting the noise in the training dat; and 


* in turn improve the algorithm's generalisation abilities.

We can use the techniques above plus intuitions about the affect of $\lambda$ on bias and variance to determine whether we need to increase or decrease $\lambda$ in order to optimise the algorithm.

### 4.3.1. Automating the Choice of $\lambda$

1. Split the dataset into three subsets as follows:

    (a) $60\%$ of the dataset into a <b>Training Set</b>;
    
    (b) $20\%$ of the dataset into a <b>Cross Validation Set</b> ("$cv$"); and

    (c) $20\%$ of the dataset into a <b>Test Set</b>.


2. Arbitrarily select a number of $\lambda$ values, e.g. $0, 0.01, 0.02, 0.04, 0.08 ... 10.24$.  

    <b>Note:</b> Andrew Ng usually doubles each $\lambda$ value when setting potential initial $\lambda$ values.


3. Create a set of models with different degrees of polynomials or any other variants.


4. Iterate each model + $\lambda$ combination to learn the resulting $\theta$ values.


5. For each resulting $\theta$ value, Min $J_{cv}(\theta)$ using the <b>non-regularised</b> version of the cost function for each model + $\lambda$ combination to calculate the resulting error on the cross-validation set.  


6. Take the $\theta$ values learned by the best model + $\lambda$ combination
from step (5) (i.e. the one with the lowest error after cross validation).


7. Min $J_{test}(\theta)$ using the <b>non-regularised</b> version of the cost function to identify if $\theta$ has a good generalization of the problem.

### 4.3.2. Bias / Variance as a function of the Regularisation Parameter

1. If $J_{train}(\theta)$ is <b>high</b> and $J_{cv}$ is approximately equal to $J_{train}(\theta)$ then there:

   (a) is <b>high bias</b> / <b>underfitting</b>; and 
    
   (b) $\lambda$ is <b>too large</b>.  


2. Conversely, if $J_{train}(\theta)$ is <b>low</b> and $J_{cv}$ is <b>much greater than </b>$J_{train}(\theta)$ then there is:

    (a) <b>high variance</b> / <b>overfitting</b>; and 
    
    (b) $\lambda$ is <b>too small</b>.

<img src="../Images/modelLambdaSelection.png" width=80%/>

# 5. Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. 

Hence:

* as the training set gets larger, the error for a quadratic function increases (i.e. because it's harder and harder to accurately fit  line to all the training samples in a dataset); and


* the error value will plateau out after a certain $m$, or training set size because the more samples provided will help the model generalise better.

Learning curves (as demonstrated above) are a tool to diagnose if a learning algorithm is suffering from bias, variance or a bit of both!

## 5.1. High Bias / Underfitting

### 5.1.1. Context

Occurs when:

1. <b>Low training set size:</b> causes $J_{train}(\Theta)$ to be <b>low</b> and $J_{CV}(\Theta)$ to be <b>high</b>.


2. <b>Large training set size:</b> causes both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ to be <b>high</b> with $J_{train}(\Theta) \approx J_{cv}(\theta)$

### 5.1.2. Solution

<img src="../Images/highBias2.png" width=80%/>

## 5.2. High Variance / Overfitting

### 5.2.1. Context

Occurs when:

1. <b>Low training set size:</b> $J_{train}(\Theta)$ will be <b>low<b> and $J_{CV}(\Theta)$ will be <b>high</b>.


2. <b>Large training set size:</b> $J_{train}(\Theta)$ increases with training set size and $J_{CV}(\Theta)$ continues to decrease without leveling off, however, the difference between them remains significant.


### 5.2.2. Solution

<img src="../Images/highVariance2.png" width=80%/>

# 6. Summary (of the above)

When assessing machine learning algorithms our decision process can be broken down as follows:

## 6.1. Generally

<img src="../Images/summaryBiasVariance.png" width=80%/>

## 6.2. Neural Networks

<b>A neural network with fewer parameters:</b>

* Prone to underfitting. 

* Computationally cheaper.


<b>A large neural network with more parameters:</b> 

* Prone to overfitting. 

* Computationally expensive. 

* In this case you can use regularization (increase λ) to address the overfitting.


<b>Number of Layers?</b>

* Using a single hidden layer is a good starting default. 

* You can train your neural network on a number of hidden layers using your cross validation set. 

* You can then select the one that performs best.

## 6.3. Model Complexity Effects

* <b>Lower-order polynomials (low model complexity):</b> have high bias and low variance. In this case, the model fits poorly consistently.


* <b>Higher-order polynomials (high model complexity):</b> fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.


* In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

## 6.4. The Power of Large Data

High Variance and High Bias can in theory be avoided if you:

1. Use a learning algorithm with <b>many parameters</b> (e.g. logistic regression / linear regression with many features) or a neural network with many hidden units).  This:

    (a) creates low bias algorithms, reducing the chance of underfitting; and
    
    (b) means $J_{train}(\theta)$ will be small.


2. Use a very large training set.  This:

    (a) creates low variance algorithms, reducing the chance of overfitting; and
    
    (b) means $J_{train}(\theta) \approx J_{test}(\theta)$ and thereofre  $J_{test}(\theta)$ will also be small.
   

# 7. Error Analysis

## 7.1. Manual Error Analysis

The recommended approach to solving machine learning problems is to:

1. Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.


2. Plot learning curves to decide if more data, more features, etc. are likely to help.


3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example:

* Assume that we have 500 emails and our algorithm misclassifies a 100 of them. 


* We could manually analyze the 100 emails and categorize them based on what type of emails they are. 


* We could then try to come up with new cues and features that would help us classify these 100 emails correctly. 

* Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. 

## 7.2. "Accuracy" is not enough

For instance, a system makes $1,000$ predictions like so to determine whether a plane passenger is terrorist (<b>positive</b>) or non-terrorist (<b>negative</b>):

<img src="../Images/confusionMatrix2.png" width=80%/>

On the one hand we can say that the system correctly predicted 998 / 1000 results and is therefore 99.9% accurate.  Instinctively most people assume this is incredible, and the only measure of success their algorithm needs.

On the other hand, the one result the system missed 50% of the total terrorists, i.e. the individual predicted as "Negative" but in fact "Positive" (see lower left corner of the above).  Much less impressive, and much more alarming!

This neatly illustrates why the better, and indeed first, question is not whether the system is "accurate" but specifically whether there is a need for:

1. <b>High Recall</b> a <b>high</b> cost for missing a true positive (e.g. a terrorist, cancer diagnosis or someone infected with a zombie virus!) and a <b>low</b> cost of mislabelling an actual true negative as a true positive (e.g. saying someone is a terrorist who is in fact not a terrorist)

    or


2. <b>High Precision:</b> a <b>low</b> cost for missing a true positive (e.g. a criminal in relation to a capital offence) and a <b>high</b> cost of mislabelling an actual true negative as a true positive (e.g. saying an innocent person is guilty of a capital offence).



These issues become readily apparent in many situations where the number of observations belonging to one class are significantly lower than those belonging to the other classes (as in the above diagram), also known as <b>imbalanced classification problems</b>.  

Real life examples include:

* Fraud detection


* Terrorism detection


* Rare disease detection


* Zombie infection

## 7.3. Precision & Recall

### 7.3.1. Precision

* <b>Intutition:</b> Of the <i>predicted</i> true positives, how many did we correctly predict.  


* <b>Mathematically:</b> precision = $\frac{\text{true positives}}{\text{true positives + false positives}}$

### 7.3.2. Recall (AKA "Sensitivity")

* <b>Intutition:</b> Of the <i>actual</i> true positives, how many did we correctly predict.  


* <b>Mathematically:</b> recall = $\frac{\text{true positives}}{\text{true positives + false negatives}}$

### 7.3.3. Confusion Matrix

The resulting spread of true positives / false positives to actual positives / actual negatives is typically represented in a confusion matix like so:

<img src="../Images/confusionMatrix1.png" width=90%/>

## 7.4. Trading off precision and recall

### 7.4.1. Increasing Precision

* <b>Why?</b>: where incorrectly predicting a true positive has a high cost, e.g. predicting someone guilty of a capital offence when in fact they are innocent.


* <b>How?</b>: increasing the decision boundary, e.g. modifiying predict 1 if $h_\theta(x) \geq 0.5$ to $h_\theta(x) \geq 0.7$.

However, increasing precision reduces recall.

### 7.4.2. Increasing Recall

* <b>Why?</b>: where incorrectly predicting a true negative has a high cost, e.g. predicting someone does not have cancer when in fact they do have cancer.


* <b>How?</b>: reducing the decision boundary, e.g. modifiying predict 1 if $h_\theta(x) \geq 0.7$ to $h_\theta(x) \geq 0.5$. 

However, increasing recall reduces precision.

## 7.5. Measuring Precision and Recall

* Precision and Recall should be measured on the <b>cross validation</b> set.  


* The value of the threshold should be that which maximises the $F_1$ score, i.e. results in the highest value.

## 7.6. Choosing the Appropriate Precision / Recall Ratio

<b>Problem:</b> If we are evaluating precision and recall scores for different algorithms we need to then evaluate which algorithm produces the optimal ratio of precision and recall.

<b>Solution:</b> the $F_1$ Score.  Mathematically this is:

\begin{align}
2 \frac{PR}{P + R}
\end{align}

<b>Difference between $F_1$ score and Accruacy:</b> as described in 7.2 above, a high "accuracy" score can be largely contributed to by a large number of True Negatives which in most business circumstances we do not focus on, whereas False Negatives and False Positives usually have business consequences, good or bad.  Therefore $F_1$ scores are potentially better if we needto balance between Precision and Recall and there is an uneven class distribution, e.g. a large number of True Negatives.

# 8. Useful Resources

- http://scott.fortmann-roe.com/docs/BiasVariance.html


- https://medium.com/datadriveninvestor/bias-and-variance-in-machine-learning-51fdd38d1f86


- https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c


- https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9