# Week 3: Logistic Regression

* [Classification and Representation](#Classification-and-Representation)
    - [Classification](#Classification)
    - [Hypothesis Representation](#Hypothesis-Representation)
    - [Decision Boundary](#Decision-Boundary)
* [Logistic Regression Model](#Logistic-Regression-Model)
    - [Cost Function](#Cost-Function)
    - [Simplified Cost Function and Gradient Descent](#Simplified-Cost-Function-and-Gradient-Descent)
    - [Advanced Optimization](#Advanced-Optimization)
* [Multiclass Classification](#Multiclass-Classification)
    - [One-vs-all](#One-vs-all)

## Classification and Representation

### Classification

Despite the appearance of the term 'regression', logistic regression is actually a classification algorithm. First we'll consider binary classification, where the output y is such that:

$$ y \in \{0, 1\} $$

That is, y is either 1 or 0; it belongs to the positive class or the negative class. For example, we might use logistic regression to predict whether an email is spam or not spam.

Unlike in linear regression where $ h_\theta(x) $ can be greater than 1 or less that zero, in logistic regression $ h_\theta(x) $ always predicts a value between 1 and 0.

$$ 0 \leq h_\theta(x) \leq 1 $$

Given $x(i)$, the corresponding $y(i)$ is also called the label for the training example.

### Hypothesis Representation

We want the output of our hypothesis to be such that $ 0 \leq h_\theta(x) \leq 1 $.  In linear regression, our hypothesis was:

$$ h_\theta(x) = \theta^Tx $$

For logistic regression, our hypothesis will be:

$$ h_\theta(x) = g(\theta^Tx) $$

where:

$$ g(z) = \frac{1}{1 + e^{-z}} $$

The function $g(z)$ is called the sigmoid or logistic function. We're using the sigmoid function because it maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Graphically:

![Sigmoid Functoin](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/1WFqZHntEead-BJkoDOYOw_2413fbec8ff9fa1f19aaf78265b8a33b_Logistic_function.png?expiry=1489795200000&hmac=Y4G3nyImatzjeqX9-3GXqFAu-NGhMnfBLN8QQ4k65V4)

We can rewrite our hypothesis function as:

$$ h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}} $$

We can treat the output of $h_\theta(x)$ as the probability that $y = 1$ on input $x$. That is:

$$ h_\theta(x) = P(y = 1 | x; \theta) $$  

<center>*"The probability that $y$ equals 1, given $x$, parameterised by $\theta$."*</center>

Remember that:

$$ 1 = P(y = 1 | x; \theta) + P(y = 0 | x; \theta) $$  

i.e. The probability that $y$ is equal to 1 or 0 is 1.

### Decision Boundary

Suppose we predict $y = 1$ if $ h_\theta(x) \geq 0.5 $ and $y = 0$ if $ h_\theta(x) < 0.5 $. When will it be the case that $ h_\theta(x) \geq 0.5 $? If we look at the graph of the sigmoid function above we can see that:

$$
\begin{align*}z=0, e^{0}=1 \Rightarrow g(z)=0.5\newline z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z)=1 \newline z \to -\infty, e^{\infty}\to \infty \Rightarrow g(z)=0 \end{align*}
$$

Since on our hypothesis, $z = \theta^Tx$:

$$ h_\theta(x) = g(\theta^Tx) \geq 0.5 $$

$$ whenever $$

$$ \theta^Tx \geq 0 $$

Say we have two input features, $x_1$ and $x_2$, so that:

$$ h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2) $$

And say that our learning algorithm (which is TBD) has decided on the values $\theta_0 = 3$, $\theta_1 = 1$ and $\theta_2 = 1$.

Then we can say:

$$\text{Predict } y = 1 \text{ if } -3 + x_1 + x_2 \geq 0 $$

The line $x_1 + x_2 = 3$ on the graph of $x_1$ vs $x_2$ marks the *decision boundary*. On this line, $h_\theta(x) = 0.5$ exactly. Any sample that falls in the region to the top right of this line where $x_1 + x_2 > 3$ has a predicted output value of 1. Any sample that falls in the region to the bottom left of this line where $x_1 + x_2 < 3$ has a predicted output value of 0.

In [None]:
# TODO : graph 2

The input to the sigmoid function $g(z)$ (e.g. $\theta^Tx$) doesn't need to be linear, and could be a function that describes a circle (e.g. $z = \theta_0 + \theta_1x_1^2+\theta_2x_2^2$) or any shape to fit our data.

In [None]:
# TODO : graph 3

## Logistic Regression Model

### Cost Function

We have a training set:

$$ {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})} $$

With $m$ samples. Each $x$ is a feature vector like so:

$$ x \in \begin{bmatrix}
x_0 \\
x_1 \\
\vdots \\
x_n \\
\end{bmatrix}$$

Where $x_0 = 1$. The output values for $y$ can be 0 or 1, so: 

$$y \in \{0, 1\}$$

Our hypothesis looks like this:

$$ h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}} $$

And $\theta$ is a vector like:

$$ \theta \in \begin{bmatrix}
\theta_0 \\
\theta_1 \\
\vdots \\
\theta_n \\
\end{bmatrix}$$

How do we choose values for the parameters $\theta$?

Our cost function for linear regression looks like this:

$$ J(\theta) = \frac{1}{m} \displaystyle \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 $$

We can't really use this cost function for logistic regression, because, when combined with our hypothesis, its graph is non-convex. It has lots of local minima and so gradient descent can't be guaranteed to work.

In [None]:
# TODO : graph 4

Let's extract part of the cost function for linear regression above:

$$ \text{Cost}(h_\theta(x), y) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 $$

$$ J(\theta) = \frac{1}{m} \displaystyle \sum_{i=1}^{m} \text{Cost}(h_\theta(x), y) $$

We want to keep the idea of summing over the training set, but modify the specifics of how we calculate the cost itself. For logistic regression, the crucial part of the cost function will be:

$$ \text{Cost}(h_\theta(x), y) = \left\{
    \begin{array}{ll}
      -\log(h_\theta(x)) \text{ if } y = 1\\
      -\log(1 - h_\theta(x)) \text{ if } y = 0
    \end{array}
  \right. $$
  
To see why this works, first consider the case where $y = 1$ and so the cost is $-\log(h_\theta(x))$. The graph of $-\log(h_\theta(x))$ looks like this:

In [None]:
# TODO : graph 5

You can see from the graph that when $y = 1$ and $h_\theta(x) = 1$ the cost is 0, yet as $h_\theta(x) \to 0$, cost $\to \infty$. That is, the less confident our hypothesis is that the correct value is 1 (which it is), the greater the cost.

Now let's consider the case where *y = 0* and the cost is $-\log(1 - h_\theta(x))$.

In [None]:
# TODO : graph 6

In this case, when $y = 0$ and $h_\theta(x) = 0$ the cost is 0, and as $h_\theta(x) \to 1$, cost $\to \infty$. The less confident the hypothesis is that $y = 0$, the higher the cost.

Notice as well that for both graphs the cost function is convex.

### Simplified Cost Function and Gradient Descent

#### Cost Function

Our overall cost function is:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \text{Cost}(h_\theta(x), y) $$
  
Where:

$$ \text{Cost}(h_\theta(x), y) = \left\{
\begin{array}{ll}
  -\log(h_\theta(x)) \text{ if } y = 1\\
  -\log(1 - h_\theta(x)) \text{ if } y = 0
\end{array}
\right. $$

We can condense this onto one line:

$$\text{Cost}(h_\theta(x), y) = -y\log(h_\theta(x)) - (1 - y)\log(1 - h_\theta(x))$$

To see why this is equivalent, remember that *y* can only be 1 or 0. Try substituting in *y = 0*. The first part of the right hand side is cancelled out and you're left with $-\log(1 - h_\theta(x))$. Then try substituting in *y = 1*. The second part of the right hand side disappears leaving $-\log(h_\theta(x))$.

In full, then, our cost function for logistic regression is:

$$ J(\theta) = -\frac{1}{m} \bigg[\sum_{i=1}^{m} y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))\bigg] $$

Vectorised:

$$ h = g(X\theta) $$
$$ J(\theta) = \frac{1}{m} \cdot (-y^T\log(h) - (1 - y)^T\log(1 - h) $$

#### Gradient Descent

Since $J(\theta)$ is convex, we can use the gradient descent algorithm:

$$\begin{align*}
&\text{Repeat } \{ \\
&\qquad\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta) \\
&\}
\end{align*}$$

Finding the partial derivate of the cost function gives us:

$$ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m}\sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

Our gradient descent algorithm now looks like this:

$$\begin{align*}
&\text{Repeat } \{ \\
&\qquad\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \\
&\}
\end{align*}$$

Which looks exactly the same as our gradient descent algorithm for linear regression! However, it's not the same as the hypothesis is different.

A vectorised implementation of the gradient descent algorithm:

$$\theta := \theta - \alpha\frac{1}{m}\sum_{i = 1}^m [(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}]$$

or more compactly:

$$\theta := \theta - \frac{\alpha}{m}X^T(g(X\theta) - \overrightarrow{y})$$


### Advanced Optimization

Gradient descent isn't the only optimization algorithm on the market. You could also use one of the following:

- Conjugate gradient
- BFGS
- L-BFGS

These are often faster than gradient descent, plus they don't require you to choose a value for alpha. On the downside, they're more complex so you shouldn't try to implement them yourself. However there are plenty of implementations available in various languages so you can use one of those.

To use one of these advanced optimization algorithms in Octave, first you have to provide a function that evaluates the following two functions for a given input value $\theta$: 

$$ J(\theta) $$
$$ \frac{\partial}{\partial\theta_j}J(\theta) $$

Which might look like this:

```octave
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
```

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".

```octave
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
```

We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

## Multiclass Classification

### One-vs-all

Say you have a number of discrete categories that you want to sort your data into. For example, for you might want to train a classifier that can predict whether an email should be put into as work, friends or family folders.

$y$ can now take on a range of values, in this example 1, 2 or 3.

Train a logistic regression classifier $h_\theta(x)$ for each class￼ to predict the probability that $y = i￼$, i.e. go through each class in turn treating it as the positive class and all the others as negative until you have values for the $\theta$ parameters for each one.

$$h_\theta^{(i)}(x) = P(y = i|x;\theta) (i = 1, 2, 3)$$

To make a prediction on a new $x$, pick the class ￼that maximizes $h_\theta(x)$, i.e. pick the most confident.