# Logistic Regression Model

## 1. Cost function

We assume that we have a training set $\{(\boldsymbol{x}^{(1)}, y^{(1)}), (\boldsymbol{x}^{(2)}, y^{(2)}), \dots, (\boldsymbol{x}^{(m)}, y^{(m)})\}$ with $m$ examples, where 

$$x=\left[
\begin{array}{c}
x_0 \\
x_1 \\
\vdots \\
x_n
\end{array}
\right]
$$

with $x_0 = 1$ and $y\in\{0, 1\}.$

The hypothesis function is:

$$
h_{\theta}(\boldsymbol{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T\boldsymbol{x}}} = \frac{1}{1 + e^{-\boldsymbol{x}^T\boldsymbol{\theta}}}
$$

How can we choose the parameters $\theta$? Does the linear regression cost works?

Recall that the linear regression cost could be expressed as:

$$
J(\boldsymbol{\theta}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(\boldsymbol{x}^{(i)}) - y^{(i)})^2.
$$

It turns out that for this hypothesis, this cost function is not convex.

Now, let's consider the **logistic regression cost function**:

$$
J(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^{m} \mathrm{cost}(h_{\theta}(\boldsymbol{x}^{(i)}), y^{(i)}),
$$

where

$$
\mathrm{cost}(h_{\theta}(\boldsymbol{x}), y) = \left\lbrace
\begin{array}{ccc}
-\log(h_{\theta}(\boldsymbol{x})) & \mathrm{if} & y=1 \\
-\log(1-h_{\theta}(\boldsymbol{x})) & \mathrm{if} & y=0.
\end{array}
\right.
$$

Note that $\mathrm{cost}(h_{\theta}(\boldsymbol{x}), y) = 0$:
- If $y=1$ and $h_{\theta}(\boldsymbol{x})=1$, or
- If $y=0$ and $h_{\theta}(\boldsymbol{x})=0$.

Otherwise, if for example $h_{\theta}(\boldsymbol{x})\to 0$ and $y=1$, then $\mathrm{cost}(h_{\theta}(\boldsymbol{x}), y)\to \infty$.

Then, this cost function captures the desired behavior.

## 2. Simplified Cost Function and Gradient Descent

Note that the term $\mathrm{cost}(h_{\theta}(\boldsymbol{x}), y)$ can be written in only one expression as:

$$
\mathrm{cost}(h_{\theta}(\boldsymbol{x}), y) = \left\lbrace
\begin{array}{ccc}
-\log(h_{\theta}(\boldsymbol{x})) & \mathrm{if} & y=1 \\
-\log(1-h_{\theta}(\boldsymbol{x})) & \mathrm{if} & y=0.
\end{array}
\right. = -y\log(h_{\theta}(\boldsymbol{x})) - (1-y)\log(1-h_{\theta}(\boldsymbol{x})).
$$

Thus, the logistic regression cost function can be rewritten as:

$$
J(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m} \left[y^{(i)}\log(h_{\theta}(\boldsymbol{x}^{(i)})) + (1-y^{(i)})\log(1-h_{\theta}(\boldsymbol{x}^{(i)}))\right],
$$

Now, recalling that 

$$
h_{\theta}(\boldsymbol{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T\boldsymbol{x}}} = \frac{1}{1 + e^{-\boldsymbol{x}^T\boldsymbol{\theta}}},
$$

we can write this cost function in a vectorized form as:

$$
J(\boldsymbol{\theta}) = -\frac{1}{m} \left[y^T \log\left(\frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right) + (1- y)^T \log\left(\frac{e^{-\boldsymbol{X}\boldsymbol{\theta}}}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)\right],
$$

where

$$
\boldsymbol{X} = \left[
\begin{array}{c}
\boldsymbol{x}^{(1)} \ ^T \\
\boldsymbol{x}^{(2)} \ ^T \\
\vdots                    \\
\boldsymbol{x}^{(n)} \ ^T
\end{array}
\right] = \left[
\begin{array}{ccccc}
x_0^{(1)} & x_1^{(1)} & x_2^{(1)} & \dots  & x_n^{(1)} \\
x_0^{(2)} & x_1^{(2)} & x_2^{(2)} & \dots  & x_n^{(2)} \\
\vdots    & \vdots    & \vdots    & \ddots & \vdots    \\
x_0^{(m)} & x_1^{(m)} & x_2^{(m)} & \dots  & x_n^{(m)}
\end{array}
\right] = \left[
\begin{array}{ccccc}
1         & x_1^{(1)} & x_2^{(1)} & \dots  & x_n^{(1)} \\
1         & x_1^{(2)} & x_2^{(2)} & \dots  & x_n^{(2)} \\
\vdots    & \vdots    & \vdots    & \ddots & \vdots    \\
1         & x_1^{(m)} & x_2^{(m)} & \dots  & x_n^{(m)}
\end{array}
\right] \in \mathbb{R}^{m \times (n+1)}
$$

is the matrix of all the training examples, and the functions $e^{(\cdot)}$ and $\log{(\cdot)}$ are understood as the componentwise application of the exponential and logarithm functions.

With the above vectorization, the gradient of the cost function is:

\begin{align}
\frac{\partial}{\partial \boldsymbol{\theta}} J(\boldsymbol{\theta}) ^T &= - \frac{1}{m} \left[y^T \left(1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}\right) - (1- y)^T \left(\frac{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}{e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)\right] \frac{\partial}{\partial \boldsymbol{\theta}}\left(\frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)^T \\
&= - \frac{1}{m} \left[y^T \left(1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}\right) - (1- y)^T \left(\frac{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}{e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)\right] \left(\frac{e^{-\boldsymbol{X}\boldsymbol{\theta}}}{\left(1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}\right)^2}\right)^T \boldsymbol{X}\\
&= - \frac{1}{m} \left[y^T \left(1 - \frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right) - (1- y)^T \left(\frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)\right] \boldsymbol{X}\\
&= \frac{1}{m} \left[\left(\frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right)^T - y^T\right] \boldsymbol{X}
\end{align}

This is:

$$
\frac{\partial}{\partial \boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{m} \boldsymbol{X}^T \left[\left(\frac{1}{1 + e^{-\boldsymbol{X}\boldsymbol{\theta}}}\right) - y\right].
$$

Having defined the gradient, we can apply some numerical optimization method to minimize $J(\boldsymbol{\theta})$ and find the parameters.

<script>
  $(document).ready(function(){
    $('div.prompt').hide();
    $('div.back-to-top').hide();
    $('nav#menubar').hide();
    $('.breadcrumb').hide();
    $('.hidden-print').hide();
  });
</script>

<footer id="attribution" style="float:right; color:#808080; background:#fff;">
Created with Jupyter by Esteban Jiménez Rodríguez. Based on the content of the Machine Learning course offered through coursera by Prof. Andrew Ng.
</footer>