# Week 3 - Logistic Regression

## Hypothesis

Logistic Function (or Sigmoid Function).

> $\displaystyle h_{\theta}(\mathbf{x}) = g(\theta^{\top}\mathbf{x})$
>
> $\displaystyle g(z) = \frac{1}{1 + e^{-z}}$

or

> $\displaystyle h_{\theta}(\mathbf{x}) = \frac{1}{1 + e^{-\theta^{\top}\mathbf{x}}}$
>
> $0 \leq h_{\theta}(\mathbf{x}) \leq 1$

$h_{\theta}(\mathbf{x})$ = estimated probability that $y = 1$ on input $\mathbf{x}$.

> $h_{\theta}(\mathbf{x}) = P(y = 1|\mathbf{x};\theta)$
>
> "probability that $y = 1$, given $\mathbf{x}$, parametrized by $\theta$".

## Decision Boundary

Predict "$y = 1$" if $h_{\theta}(\mathbf{x}) \ge 0.5$

> $g(\theta^{\top}\mathbf{x}) \ge 0.5$ when $\theta^{\top}\mathbf{x} \ge 0$ 

Predict "$y = 0$" if $h_{\theta}(\mathbf{x}) \lt 0.5$

> $g(\theta^{\top}\mathbf{x}) \lt 0.5$ when $\theta^{\top}\mathbf{x} \lt 0$ 

## Cost Function

The mean-square cost function used for linear regression is a **non-convex** function when used with sigmoid hypothesis !

Instead we will use the **Cross-entropy** loss (or log loss).

> $\displaystyle Cost(h_{\theta}(\mathbf{x}), y) = \left.
    \begin{cases}
        -log(h_{\theta}(\mathbf{x})) & \text{if } y = 1 \\
        -log(1 - h_{\theta}(\mathbf{x})) & \text{if } y = 0 \\
    \end{cases}
    \right\}$
>
> $\displaystyle Cost(h_{\theta}(\mathbf{x}), y) = -(y) log(h_{\theta}(\mathbf{x})) -(1 - y) log(1 - h_{\theta}(\mathbf{x}))$

so

> $\displaystyle J(\theta) = -\frac{1}{m}\bigg[\sum_{i=1}^{m}y^{(i)} log(h_{\theta}(\mathbf{x}^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(\mathbf{x}^{(i)}))\bigg]$

## Gradient Descent

Repeat until convergence:

> $\displaystyle \theta_{j} = \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta_{0}, \theta_{1},\dots,\theta_{n}), j \in \{0,1,\dots,n\}$
>
> $\alpha$ is the **learning rate**.

or

> $\displaystyle \theta_{j} = \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{j}^{(i)}$

The algorithm looks identical to linear regression, but $h_{\theta}(\mathbf{x})$ is different !

## Advanced Optimization

Optimization Algorithms:

- Gradient Descent
- Conjugate Gradient
- BFGS
- L-BFGS


An example in Octave with $J(\theta) = (\theta_{1} - 5)^{2} + (\theta_{2} - 5)^{2}$:

```octave
function [jVal, gradient] = costFunction(theta)
    jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
    gradient = zeros(2,1);
    gradient(1) = 2*(theta(1)-5);
    gradient(2) = 2*(theta(2)-5);
end
```

```octave
options = optimset('GradObj','on','MaxIter',100);
```

```octave
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
```

## Multiclass Classification
 
Train a logistic regression classifier $h_{\theta}^{(i)}(\mathbf{x})$ for each class $i$ to predict the probability that $y = i$.

On a new input $\mathbf{x}$ , to make a prediction, pick the class $i$ that maximizes:

> $\displaystyle \max_{i}h_{\theta}^{(i)}(\mathbf{x})$

## The Problem of Overfitting

- An "Underfit" model is said to have "High Bias".
- An "Overfit" model is said to have "High Variance".

### Adressing Overfitting

Options:

1. Reduce number of features
    - Manually select which features to keep.
    - Model selection algorithm(later in course).
2. Regularization
    - Keep all the features, but reduce magnitude/values of parameters $\theta_{j}$.
    - Works well when we have  lot of features , each of which contributes a bit to predicting $y$. 

## Regularization

Small values for $\theta_{0},\theta_{1},\dots,\theta_{n}$.

- "Simpler" hypothesis
- Less prone to overfitting

> $\displaystyle J(\theta) = \frac{1}{2m}\left[\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})^{2} + \lambda\sum_{j=1}^{n}\theta_{j}^{2}\right]$
>
> $\lambda$ is the **regularization parameter**.
>
> $j$ starts at $1$, we **do not** regularize $\theta_{0}$!

If $\lambda$ is very large:

> $h_{\theta}(\mathbf{x}) = \theta_{0}$

## Regularized Linear Regression

### Cost Function

> $\displaystyle J(\theta) = \frac{1}{2m}\left[\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})^{2}\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}$

### Gradient Descent

Repeat until convergence:

> $\displaystyle \theta_{0} := \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{0}^{(i)}$, we **don't** penalize $\theta_{0}$!
>
> $\displaystyle \theta_{j} := \theta_{j} - \alpha\bigg[\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{j}^{(i)} + \frac{\lambda}{m}\theta_{j}\bigg]$

or:

> $\displaystyle \theta_{j} := \theta_{j}\big(1 - \alpha\frac{\lambda}{m}\big) - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{j}^{(i)}$
>
> $1 - \alpha\frac{\lambda}{m}$ is usually $\lt$ 1.

### Normal Equation

> $\displaystyle \mathbf{X} = \begin{bmatrix}
    (\mathbf{x}^{(1)})^{\top} \\
    \vdots \\
    (\mathbf{x}^{(m)})^{\top}
\end{bmatrix} \in \mathbb{R}^{m \times (n+1)}$
>
> $\displaystyle \mathbf{y} = \begin{bmatrix}
    y^{(1)} \\
    \vdots \\
    y^{(m)}
\end{bmatrix} \in \mathbb{R}^{m}$
>
> $\displaystyle \mathbf{\theta} = \Bigg(\mathbf{X}^{\top}\mathbf{X} + \lambda\begin{bmatrix}
    0 & 0 & 0 & \cdots & 0 \\
    0 & 1 & 0 & \cdots & 0 \\
    0 & 0 & 1 & \cdots & 0 \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & 0 & \cdots & 1
\end{bmatrix}\Bigg)^{-1}\mathbf{X}^{\top}\mathbf{y} \in \mathbb{R}^{n+1}$

If $m \leq n$, $\mathbf{X}^{\top}\mathbf{X}$ will be singular or "non invertible", using regularization will correct this!

## Regularized Logistic Regression

### Cost Function

> $\displaystyle J(\theta) = -\frac{1}{m}\bigg[\sum_{i=1}^{m}y^{(i)} log(h_{\theta}(\mathbf{x}^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(\mathbf{x}^{(i)}))\bigg] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}$

### Gradient Descent

> $\displaystyle \theta_{0} := \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{0}^{(i)}$, we **don't** penalize $\theta_{0}$!
>
> $\displaystyle \theta_{j} := \theta_{j} - \alpha\bigg[\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{j}^{(i)} + \frac{\lambda}{m}\theta_{j}\bigg]$

or:

> $\displaystyle \theta_{j} := \theta_{j}\big(1 - \alpha\frac{\lambda}{m}\big) - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) - y^{(i)})\mathbf{x}_{j}^{(i)}$
>
> $1 - \alpha\frac{\lambda}{m}$ is usually $\lt$ 1.
