# The best way to spend time

- Get more training data
- Smaller set of features (prevent overfitting)
- Additional features 
- Polinomial features
- Modifing regularization parameter

__MACHINE LEARNING DIAGNOSTIC__

Test which when run on the learning algorithm can provide insight into how best to improve the performance of it.

Can take time to implement, however they are a very good use of time.

# Evaluating a hypothesis

A low error training hypothesis may not be an appropriate one (overfitting)

High number of features make simple plots a non trivial task.

Splitting the dataset into training & testing data. Tipically 70% training, 30% testing. If data is not randomly ordered, this should happen prior to splitting.

## Train/testing procedure

__(lin-reg)__

1. Learn $\theta$ from training $(\min J(\theta))$
2. Compute test error: $J_{test} = \frac{1}{2m_{test}} \sum\limits_{i=1}^{m_{test}}\left( h_{\theta}(x^{(i)}_{test}) - y^{(i)}_{test})\right)^2$

__(logist-reg)__

1. Learn $\theta$ from training $(\min J(\theta))$
2. Compute test error: $J_{test} = -\frac{1}{m_{test}} \sum\limits_{i=1}^{m_{test}}\left(y^{(i)}_{test} \text{log}\:h_{\theta}(x^{(i)}_{test}) + (1 - y^{(i)}_{test})\text{log}\:h_{\theta}(x^{(i)}_{test}) \right)$
3. Misclassification error (1/0 missclassification error):

$ err(h_{\theta}(x), y) = \begin{cases} 
1 & \mbox{if } h_{\theta}(x) \geqslant  0.5, & y=0 & \mbox{(Hypothesis $\rightarrow$ 1, target = 0)}\\ 
  & \mbox{ or if } h_{\theta}(x) <  0.5, & y=1 & \mbox{(Hypothesis $\rightarrow$ 0, target = 1)}\\
 0 & \mbox{Otherwise}\\  
\end{cases} $

$\mbox{Test error} = \frac{1}{m_{test}} \sum\limits^{m_{test}}_{i=1} err(h_{\theta}(x^{(i)}_{test}), y^{(i)}_{test})$

# Model Selection

Degree of polynomial:

1. $h_{\theta}(x) = \theta_0 + \theta_1x$
2. $h_{\theta}(x) = \theta_0 + \theta_1x + \theta_2x^2$
3. $h_{\theta}(x) = \theta_0 + \theta_1x + \dots +\theta_3x^3$
4. $\dots$
5. $h_{\theta}(x) = \theta_0 + \theta_1x + \dots +\theta_10x^{10}$

As if additional parameter $\rightarrow$ $d = \mbox{degree of polynomial}$

Using $d$ not suitable since the need it to have a good fit not to have the optimal fit for the training set.

__Method__

1. Split into training (60%), cross validation (cv) (20%), testing sets (20%).
2. Compute errors:
    - $J_{train} = \frac{1}{2m_{}} \sum\limits_{i=1}^{m_{}}\left( h_{\theta}(x^{(i)}_{}) - y^{(i)}_{})\right)^2$
    - $J_{cv} = \frac{1}{2m_{cv}} \sum\limits_{i=1}^{m_{cv}}\left( h_{\theta}(x^{(i)}_{cv}) - y^{(i)}_{cv})\right)^2$
    - $J_{test} = \frac{1}{2m_{test}} \sum\limits_{i=1}^{m_{test}}\left( h_{\theta}(x^{(i)}_{test}) - y^{(i)}_{test})\right)^2$
3. Select the hypothesis with the smallest cross-validation error.


# Bias vs Variance

Most of times when algorithm underperforms:
- High Bias $\rightarrow$ Underfitting 
- High Variance $\rightarrow$ Overfitting

Important in order to better improve the learning algorithm

## Diagnosing bias/variance

Looking at the training and cross-validation errors vs order polynomial of the hypothesis:

<img src="bias-variance.png">

- High Bias (Underfit):
    - $J_{train}(\theta) \rightarrow$ high
    - $J_{cv}(\theta) \approx J_{train}(\theta)$
- High Variance (Overfit):
    - $J_{train}(\theta) \rightarrow$ low
    - $J_{cv}(\theta) >> J_{train}(\theta)$

## Regularization bias/variance

Using a high order hypothesis:

- High values of $\lambda \rightarrow$ Underfitting
- Low values of $\lambda \rightarrow$ Overfitting

__Method__

1. Try a range of $\lambda$ in the cost function $J(\theta)$ with regularization & find out the parameters $\theta$.
2. Use these parameters $\theta$ and the non regularized cost function on the cross-validation dataset to choose the best value of $\lambda$ ($\min J_{cv}(\theta)$)
3. Confirm the generalization with the test dataset.

The $\lambda$ parameter is chosen from the cross validation dataset. The test dataset is used to confirm this.

<img src="bias-variance_reg.png">

# Learning curves

$J_{train}$ & $J_{cv}$ vs $m$ (training set size)


<img src="learning-curves.png">

## High bias (low $O$ hypothesis)

$J_{cv}$ & $J_{train}$ will be very similar. The hypothesis has very few parameters & will have the same error no matter what dataset is given. However this error will be high.

For high bias learning algorithms, getting more data will __NOT__ help improve the learning algorithm.

<img src="learning-curves_highBias.png">

## High variance (high $O$ hypothesis)

$J_{train}$ will increase but remain low. $J_{cv}$ will slightly decrease with size of dataset but will remain high. There will be a __GAP__ between the training & cross-validation errors.

For high variance learning algorithms, getting more data is __likely__  to help improve the learning algorithm.

<img src="learning-curves_highVariance.png">

# Best way to spend time II


| Problem | Fix |
|:------- |:--- |
| High variance  | Get more training data |
| "" | Smaller set of features (prevent overfitting) |
| High bias | Additional features |
| "" | Polinomial features |
| "" | Decreasing $\lambda$ |
| High variance | Increasing $\lambda$ |

