# Week 6

## Debugging a Learning Algorithm

- Get more training examples
- Try smaller set of features
- Try getting additional features
- Try adding polynomial features
- Try increasing $\lambda$.
- Try decreasing $\lambda$.

## Evaluating your Hypothesis

Split your examples between a **Training Set (70%)** and a **Test Set (30%)**.

If the data is not randomly ordered, it is better to randomly shuffle it before picking these sets.

## Training/Testing procedure for Linear Regression

- Learn parameter $\theta$ from training data (minimizing training error $J(\theta)$).
- Compute test set error:
> $\displaystyle J_{test}(\theta) = \frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_{\theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2$

## Training/Testing procedure for Logistic Regression

- Learn parameter $\theta$ from training data.
- Compute test set error:
> $\displaystyle J_{test}(\theta) = -\frac{1}{m_{test}}\sum_{i=1}^{m_{test}} y_{test}^{(i)}\log(h_{\theta}(x_{test}^{(i)})) + (1 -  y_{test}^{(i)})\log(1 - h_{\theta}(x_{test}^{(i)}))$
- or the alternative **Misclassification error** (0/1 misclassification error):
> $\displaystyle err(h_{\theta}(x),y) = \begin{cases}
    1 & \text{ if } & h_{\theta}(x) \ge 0.5, y = 0 \\
     & \text{ or if } & h_{\theta}(x) \lt 0.5, y = 1 \\
    0 & \text{ otherwise} &
\end{cases}$
>
> $\displaystyle \text{Test error} = \frac{1}{m_{test}}\sum_{i=1}^{m_{test}}err(h_{\theta}(x_{test}^{(i)}),y_{test}^{(i)})$

## Model Selection and Training/Validation/Test sets

To find the **degree $d$** of a polynomial feature:

1. Split your examples between a **Training Set (60%)**, a **Cross Validation Set (20%)** and a **Test Set (20%)**.
2. Fit $\theta^{(d)}$ for different values of $d$ on the **Training Set**.
> $\displaystyle J_{train}(\theta) = \frac{1}{2m_{train}}\sum_{i=1}^{m_{train}}(h_{\theta}(x_{train}^{(i)}) - y_{train}^{(i)})^2$
3. Pick $\theta^{(d)}$ with the lowest cost on **Validation Set**.
> $\displaystyle J_{cv}(\theta) = \frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}(h_{\theta}(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$
4. Estimate generalization error on **Test Set** with selected $\theta^{(d)}$.
> $\displaystyle J_{test}(\theta) = \frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_{\theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2$

## Diagnosing Bias vs Variance in Model Selection

If we have a **Bias (underfit)** problem:
> $J_{train}(\theta)$ and $J_{cv}(\theta)$ will be high.
>
> $J_{cv}(\theta) \approx J_{train}(\theta)$

If we have a **Variance (overfit)** problem:
> $J_{train}(\theta)$ will be low.
>
>$J_{cv}(\theta) \gg J_{train}(\theta)$

## Choosing the Regularization Parameter $\lambda$

To find the **regularization parameter $\lambda$** of a polynomial feature:

1. Split your examples between a **Training Set (60%)**, a **Cross Validation Set (20%)** and a **Test Set (20%)**.
2. Fit $\theta^{(\lambda)}$ for different values of $\lambda$ on the **Training Set** using **regularized cost function**.
> $\displaystyle J_{train}(\theta) = \frac{1}{2m_{train}}\sum_{i=1}^{m_{train}}(h_{\theta}(x_{train}^{(i)}) - y_{train}^{(i)})^2 + \frac{\lambda}{2m_{train}}\sum_{j=1}^{n}\theta_{j}^{2}$
>
> $\lambda \in \{0, 0.01, 0.02, 0.04, 0.08,\dots, 10.24\}$
3. Pick $\theta^{(\lambda)}$ with the lowest cost on **Validation Set**, using **un-regularized cost function**.
> $\displaystyle J_{cv}(\theta) = \frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}(h_{\theta}(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$
4. Estimate generalization error on **Test Set** with selected $\theta^{(\lambda)}$, using **un-regularized cost function**.
> $\displaystyle J_{test}(\theta) = \frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_{\theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2$

## Learning Curves

To plot learning curves:

1. Split your examples between a **Training Set (60%)**, a **Cross Validation Set (20%)** and a **Test Set (20%)**.
2. Fit $\theta^{(m)}$ for samples of size $m$ of the **Training Set** using **regularized or un-regularized cost function**.
> $\displaystyle J_{train}(\theta) = \frac{1}{2m_{train}}\sum_{i=1}^{m_{train}}(h_{\theta}(x_{train}^{(i)}) - y_{train}^{(i)})^2$
>
> or
>
> $\displaystyle J_{train}(\theta) = \frac{1}{2m_{train}}\sum_{i=1}^{m_{train}}(h_{\theta}(x_{train}^{(i)}) - y_{train}^{(i)})^2 + \frac{\lambda}{2m_{train}}\sum_{j=1}^{n}\theta_{j}^{2}$
3. Plot $J(\theta^{(m)})$ on **Training Set** for each $\theta^{(m)}$, using **un-regularized cost function**.
> $\displaystyle J_{train}(\theta) = \frac{1}{2m_{train}}\sum_{i=1}^{m_{train}}(h_{\theta}(x_{train}^{(i)}) - y_{train}^{(i)})^2$
4. Plot $J(\theta^{(m)})$ on **Validation Set** for each $\theta^{(m)}$, using **un-regularized cost function**.
> $\displaystyle J_{cv}(\theta) = \frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}(h_{\theta}(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$

If a learning algorithm is suffering from **high bias**, getting more data will not help much.

If a learning algorithm is suffering from **high variance**, getting more data is likely to help.


# Deciding what to do next

- Get more training examples $\rightarrow$ **fixes high variance**.
- Try smaller set of features $\rightarrow$ **fixes high variance**.
- Try getting additional features $\rightarrow$ **fixes high bias**.
- Try adding polynomial features $\rightarrow$ **fixes high bias**.
- Try increasing $\lambda$ $\rightarrow$ **fixes high variance**.
- Try decreasing $\lambda$ $\rightarrow$ **fixes high bias**.

# Neural Networks and Overfitting

*Small* neural networks:

- Are more prone to underfitting.
- are computationally cheaper.

*Large* neural networks:

- Are more prone to overfitting, which can be adressed by using regularization ($\lambda$).
- Are computationally more expensive.
- We can choose the number of hidden layers by fitting $\theta^{(L)}$ for an increasing number of hidden layers($L$) on a **Training Set**,
then pick the lowest cost on a **Validation Set**. 

# Recommended Approach

1. Start with a **simple algorithm** that you can implement quickly.
Implement it and test it on your cross-validation data.
2. Plot **learning curves** to decide if more data, or features, etc. are likely to help.
3. **Error analysis**: Manually examine the examples(in cross-validation set) that your algorithm made errors on.
See if you can spot any systematic trend in what type of example is made errors on by comparing model accuracies.

# Error Metrics for Skewed Classes

We have a case of **skewed classes** when we have far more examples of one class than the other classes.

With **skewed classes** it becomes much harder to use classification accuracy!

# Precision/Recall

$y = 1$ in presence of a rare class that we want to detect.

In [1]:
%%html
<table>
  <tr>
    <td colspan="2" rowspan="2" align="left" valign="top">
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
    <td colspan="2" align="left" valign="top">
      &nbsp;&nbsp;Actual&nbsp;Class&nbsp;&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
  </tr>
  <tr>
    <td align="left" valign="top">
      &nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
    <td align="left" valign="top">
      &nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
  </tr>
  <tr>
    <td rowspan="2" align="left" valign="top">
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
      Predicted<br />
      &nbsp;&nbsp;Class&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
    <td align="left" valign="top">
      &nbsp;&nbsp;1&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
    <td align="left" valign="top">
      &nbsp;&nbsp;True&nbsp;&nbsp;<br />
      positive
    </td>
    <td align="left" valign="top">
      &nbsp;False&nbsp;&nbsp;<br />
      positive
    </td>
  </tr>
  <tr>
    <td align="left" valign="top">
      &nbsp;&nbsp;0&nbsp;&nbsp;<br />
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </td>
    <td align="left" valign="top">
      &nbsp;False&nbsp;&nbsp;<br />
      negative
    </td>
    <td align="left" valign="top">
      &nbsp;&nbsp;True&nbsp;&nbsp;<br />
      negative
    </td>
  </tr>
</table>


0,1,2,3
,,Actual Class,Actual Class
,,1,0
Predicted  Class,1.0,True positive,False positive
Predicted  Class,0.0,False negative,True negative


**Precision**:

> $\displaystyle \frac{\text{True Positives}}{\text{Predicted Positives}} = 
   \frac{\text{True Positives}}{\text{True Positive}+\text{False Positives}}$
>
> High precision is a good thing.
   
**Recall**:

> $\displaystyle \frac{\text{True Positives}}{\text{Actual Positives}} = 
   \frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$
>
> High recall is a good thing.

A classifier with **high presision** and **high recall** is a good classifier.

**Precision/Recall** is often a better way to evaluate an algorithm in the presence of **skewed classes** than looking at classificator error or classificator accuracy.

# F$_{1}$ Score (F Score)

To compare precision/recall numbers:

> $\displaystyle 2\frac{PR}{P+R}$

- If $P = 0$ or $R = 0$ then F-Score $= 0$.
- If $P = 1$ and $R = 1$ then F-Score $= 1$.
- $0 \le$ F-Score $\le 1$


# Threshold

In Logistic Regression:

> $0 \le h_{\theta}(x) \le 1$
>
> Predict $1$ if $h_{\theta}(x) \ge threshold$
>
> Predict $0$ if $h_{\theta}(x) \lt threshold$

By varying the threshold we can control the trade-off between precision and recall.

- With a high threshold, we get a high presision and a low recall.
- With a low threshold, we get a low presision and a high recall.

To find the optimal **threshold** for your model:
- Evaluate different values of **threshold** on the **Cross Validation Set** and pick the one that gives you the maximum value of **F Score**.