# Logistic Regression

Using linear regression is usually not good for classification problems because the data usually don't fit a line. Instead, a very roboust algorithm in this situation is called the logistic regression, in which that we assume that our model's prediction satisfies $$h_{\theta} (x) \in [0, 1]$$ and that we want the prediction to fit like a sigmoid function instead of just a straight line $$h_{\theta} (\mathbf{x}) = \frac{1}{1 + e^{-\theta^T \mathbf{x}}}$$

Furthermore, we assume that our data can **only** be $0$ or $1$:
- $P(y=1 \ |\  \mathbf{x}; \theta) = h_{\theta} (\mathbf{x})$
- $P(y=0 \ |\  \mathbf{x}; \theta) = 1 - h_{\theta} (\mathbf{x})$

From these assumptions, it can be shown that we want to find parameter $\theta$ to maximize the **log likelihood**: $$
l(\theta) = \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]
$$


In [None]:
import math, copy
import numpy as np
from numpy import ndarray
import matplotlib.pyplot as plt

In [None]:
def h(theta, x):
    """
    Compute the hypothesis
    Args:
        theta: Shape (n+1,).
        x:     All training examples of shape (m, n+1).
    
    Return: The hypothesis for all training examples. Shape (m,).

    Complexity: O(m * n)
    """
    S = - x @ theta

    return 1/(1 + np.exp(S))
    

### Variables:
* $n$ is the number of features.
* $m$ is the number of training examples.
* $\mathbf{\theta}$ is the parameters, an $(n+1)$-dimensional vector, where $\mathbf{\theta}_0$ is the bias.
* $\mathbf{x}^{(i)}$ is the features, an $(n+1)$-dimensional vector.
* $y^{(i)}$ is the actual target value for the $i^{th}$ example.
* $J(\theta)$ is the cost function.

## Newton's method
We can use  Newton's method to implement logistic regression. Let $f$ be a real value function. Newton's method find zeros of the function $f(\theta)$ by iteratively performs the update: $$ \theta := \theta - \frac{f(\theta)}{f'(\theta)} $$ 

Since the maxima of the log likelihood function $l$ corresponds to points where $l'(\theta) = 0$, we can use Newton's method to find the maxima of $l$ by setting $f = l'$, giving the update rule: $$\theta = \theta - \frac{l'(\theta)}{l''(\theta)}$$

More generally, the update formula for multi-dimensional Newton's formula is $$
\theta := \theta - H^{-1} \nabla_{\theta} l(\theta)
$$
where $$
H_{ij} = \frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}
$$
is the Hessian matrix.


Advantage:
- Compare to gradient descant/ascent, the Newton's method takes a bigger jump each step.
- Quadric convergence

Disadvantage:
- In higher dimension, each steps require inverting a $(n+1), (n+1)$ matrix, so it become very computational expensive.

But overall, if there is around **10 - 50** parameters, Newton's method is a very good way to implement logistic regression.


## Hessian matrix for the log likelihood function
Recall we have log likelihood function $$
l(\theta) = \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]
$$
The Hessian matrix of the log likelihood function can be computed as $$
H = - \sum_{i=1}^{m} h_{\theta}(x^{(i)})(1 - h_{\theta}(x^{(i)})) x^{(i)} (x^{(i)})^T
$$

Note:
This formula for Hessian does not depends on y.

In [None]:
def hessian(theta, x):
    """
    Compute the Hessian matrix for the log likelihood function l.
    Args:
        theta: Shape (n+1,).
        x:     All training examples of shape (m, n+1).

    Return: The Hessian matrix. Shape (n+1, n+1).
    """
    h = h(theta, x)
    H = 0

    for i in range(m):
        H = H - h[i] * (1-h[i]) * np.dot(x[i], x[i])
    
    return H