# Logistic regression

Classification algorithm
1. Binary Classfication

    $ y \in \{0, 1\}$

    * 0: 'Negative Class'
    * 1: 'Positive Class'

2. Multi-Class Classification 

    $ y \in \{0, 1, 2, ..., n\}$

    * 0: 'Class  0'
    * 1: 'Class  1'
    * 2: 'Class  2'
    * ...
    * n: 'Class  n'

## Logistic regression model

It is required that $ 0 \leq h_\theta(x) \leq 1 $

Our linear regression hypothesis was $ h_\theta(x) = \theta^T \underline{X} $

The logistic regression hypothesis is $ h_\theta(x) = g(\theta^T \underline{X}) $

Where $g(z) = \frac{1}{1+e^{-z}}$ is the Sigmoid or Logistic function

The hypothesis is:

$$ h_\theta(x) = \frac{1}{1+e^{-\theta^T \underline{X}}} $$



### Interpretation of the results

$h_\theta(x) = $ estimated probability that $y=1$ on input $x$

$$ h_\theta(x) = P(y=1|x;\theta)$$

Since

$$ P(y=0|x;\theta) + P(y=1|x;\theta) = 1$$

The probability that $y=0$ on input x is :

$$ P(y=0|x;\theta) = 1 - P(y=1|x;\theta)$$

# Multiclass classification

Group variables in more than two classes. Apply one vs all methodoogy.

## One vs all

Carry out classification on each of the groups & distinguish it against the rest.

Class $i$ vs rest $\rightarrow$ $h_\theta^{(i)} = P(y=i|x;\theta) \;\;\; (i=1,2, ...)$

When interrogating the classifier with $x$, take the class $i$ which has largest output. 

$max\;h_\theta^{(i)}(x)$

# The problem of overfitting

In an attempt to avoid underfitting, the order of magnitude of cost function is increassed. This can be overdone and cause the model to __overfit__ or to have a __high variance__ .

Cost function too many features. Fits traiing set very well & fails to generalize to a new example. 

This concept is aplicable to linear as well as logistic regression. 

## Addressing overfitting

When higher order feature models, not trivial to visualise.

__Options__:

1. Reduce number of features
    - Manually select
    - Model selection algorithm
    - (reducing ammount of information... not a desirable effect)
2. Regularization
    - Keep features, reduce magnitude/ values of parameters $\theta_j$
    - Works well with multi-feature problems, when reducing not appropriate.

# Regularization

Smaller values for parameters
- Simpler hypothesis (smoother curves)
- Less prone to overfitting

Modify cost fucntion to have a regularization term:
$$ J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum\limits_{i=1}^n \theta^2_j$$

Where $\lambda$ is the regularization parameter.

Note that by convention $\theta_0$ is not regularised. Regularization sumatory starts at $\theta_1$

## Regularization parameter, $\lambda$

Carries out the tradeof between:
- Fit well the training set
- Maintain small parameters to avoid overfit.

If $\lambda$ is too large, the fit will be a straight line, resulting in an underfit.

# Regularized linear regression

__Gradient descent__

Repeat {

$ \theta_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0} J(\theta)$

$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_0} J_{regularized}(\theta) \;\; (j = 1, 2, ..., n)$

}

Repeat {

$ \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}$

$ \theta_j := \theta_j - \alpha \left[ \frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j \right] \;\; (j = 1, 2, ..., n)$

}

Repeat {

$ \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}$

$ \theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) -\alpha \frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}   \;\; (j = 1, 2, ..., n)$

}

$(1 - \alpha\frac{\lambda}{m}) < 1$

# Regularized normal equation

__Normal equation__ 

$$ X = \begin{bmatrix}
    (x^{(1)})^T\\
    \vdots\\
    (x^{(m)})^T
\end{bmatrix}
\;\;\; 
y = \begin{bmatrix}
    y^{(1)}\\
    \vdots\\
    y^{(m)}
\end{bmatrix}$$

Where $X$ is a $(m\times(n+1))$ matrix and $y$ is a $(m\times 1)$ vector

$\min\limits_\theta J(\theta)$

$$ \theta = \left( X^TX + \lambda \underline{R} \right)^{-1} X^Ty $$

Where the regularization matrix $\underline{R}$ is a $((n+1)\times(n+1))$ like:

$$ R = \begin{bmatrix}
    0 & 0 & \dots & 0\\
    0 & 1 & \dots & 0\\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \dots & 1
\end{bmatrix}$$

Beware of non-inversible matrices when $m\leq n$ 

# Regularized logistic regression

__Cost funciton for logistic regression__

$ J(\theta) = -\left[ \frac{1}{m} \sum\limits^m_{(i=1)} y^{(i)} log\:h_\theta(x^{(i)}) + (1-y^{(i)})\:log(1-h_\theta(x^{(i)})\: )  \right] $

__Regularized cost funciton for logistic regression__

$ J(\theta) = -\left[ \frac{1}{m} \sum\limits^m_{(i=1)} y^{(i)} log\:h_\theta(x^{(i)}) + (1-y^{(i)})\:log(1-h_\theta(x^{(i)})\: )  \right] + \frac{\lambda}{2m} \sum\limits^n_{j=1}\theta_j^2 $

__Gradient descent__


Repeat {

$ \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}$

$ \theta_j := \theta_j - \alpha \left[ \frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j \right] \;\; (j = 1, 2, ..., n)$

}

However for logistic regression $h_\theta(x) = \frac{1}{1+e^{-\theta^T\underline{X} }}$