# Logistic Regression

$$
s=\sum_{i=0}^d w_i x_i
$$

$$
h(\mathbf{x})=\theta(s)
$$


![image.png](../images/logistic-regression.png)

The logistic function $\theta$:

$$
\theta(s)=\frac{e^s}{1+e^s}
$$

It ranges from 0 to 1.  Called "soft threshold" or "sigmoid" (cus it looks like an S).

# Target function

Data $(x, y)$ with binary $y$, generated by a noisy target:

$$
P(y \mid \mathbf{x})= \begin{cases}f(\mathbf{x}) & \text { for } y=+1 \\ 1-f(\mathbf{x}) & \text { for } y=-1\end{cases}
$$

The target $f: \mathbb{R}^d \rightarrow[0,1]$ is the probability.

We will learn $g(\mathbf{x})=\theta\left(\mathbf{w}^{\top} \mathbf{x}\right) \approx f(\mathbf{x})$.

# Error measure

Plausible error based on likelihood: If $h = f$, how likely to get $y$ from $x$?

$$
P(y \mid \mathbf{x})= \begin{cases}h(\mathbf{x}) & \text { for } y=+1 \\ 1-h(\mathbf{x}) & \text { for } y=-1\end{cases}
$$

Substitute $h(\mathbf{x})=\theta\left(\mathbf{w}^{\top} \mathbf{x}\right)$, noting $\theta(-s)=1-\theta(s)$ we get:

$$
P(y \mid \mathbf{x})=\theta\left(y \mathbf{w}^{\top} \mathbf{x}\right)
$$

So if $h = f$, likelihood of entire data set $\mathcal{D}$ is given by:

$$
\prod_{n=1}^N P\left(y_n \mid \mathbf{x}_n\right)=\prod_{n=1}^N \theta\left(y_n \mathbf{w}^{\top} \mathbf{x}_n\right)
$$

Maximizing this likelihood is the same as maximizing:

$$
\frac{1}{N} \ln \left(\prod_{n=1}^N \theta\left(y_n \mathbf{w}^{\top} \mathbf{x}_n\right)\right)
$$

Which is the same as minimizing the negative of it:

$$
\begin{aligned}
& -\frac{1}{N} \ln \left(\prod_{n=1}^N \theta\left(y_n \mathbf{w}^{\top} \mathbf{x}_n\right)\right) \\
& =\frac{1}{N} \sum_{n=1}^N \ln \left(\frac{1}{\theta\left(y_n \mathbf{w}^{\top} \mathbf{x}_n\right)}\right)
\end{aligned}
$$

Substituting $\theta(s)=\frac{1}{1+e^{-s}}$, we get our "cross-entropy error":

$$
E_{\text {in }}(\mathbf{w})=\frac{1}{N} \sum_{n=1}^N \underbrace{\ln \left(1+e^{-y_n \mathbf{w}^{\top} \mathbf{x}_n}\right)}_{\mathrm{e}\left(h\left(\mathbf{x}_n\right), y_n\right)}
$$



# Learning Algorithm: Gradient Descent

$$
\Delta E_{\text {in }}=E_{\text {in }}(\mathbf{w}(0)+\eta \hat{\mathbf{v}})-E_{\text {in }}(\mathbf{w}(0))
$$

This equals the derivative times the difference:

$$
=\eta \nabla E_{\text {in }}(\mathbf{w}(0))^{\mathrm{T}} \hat{\mathbf{v}}+O\left(\eta^2\right)
$$

We will neglect the 2nd order term.  The least we can get is the negative of the norm, if $\hat{v}$ goes in opposite direction of the gradient:

$$
\geq-\eta\left\|\nabla E_{\text {in }}(\mathbf{w}(0))\right\|
$$

So if $\hat{v}$ is a unit vector going in the opposite direction of the gradient, getting us the lowest possible value for $\Delta E_{\text {in }}$ which we desire:

$$
\hat{\mathbf{v}}=-\frac{\nabla E_{\text {in }}(\mathbf{w}(0))}{\left\|\nabla E_{\text {in }}(\mathbf{w}(0))\right\|}
$$

We want to scale our step size to the steepness of the gradient.  We can do this by having our $\hat{v}$ not be a unit vector, by removing the denominator, so it auto-scales with the gradient:

$$
\Delta \mathbf{w}=-\eta \nabla E_{\text {in }}(\mathbf{w}(0))
$$

The gradient is composed of our partials of the error function $e(h(\mathbf{x_n}), y_n)=\ln \left(1+e^{-y_n \mathbf{w}^T \mathbf{x_n}}\right)$ with regard to each of the weights:

$$
\frac{\partial e(h(x_n), y_n)}{\partial w_k}=\frac{1}{1+e^{-y_n w^T \mathbf{x_n}}} \cdot e^{-y_n \mathbf{w}^T \mathbf{x_n}} \cdot\left(-y_n x_{nk}\right)=-\frac{y_n \cdot x_{nk}}{e^{y_n w^T \mathbf{x_n}}+1}
$$

The entire gradient for all the weights is given by:

$$
\nabla e = \frac{-y_n \mathbf{x_n}}{e^{y_n \mathbf{w}^T \mathbf{x_n}}+1}
$$


In [29]:
# Problems 8-9 on https://work.caltech.edu/homework/hw5.pdf
# https://nbviewer.org/github/homefish/edX_Learning_From_Data_2017/blob/master/homework_5/hw5_p8_9_logistic_regression.ipynb is helpful
import numpy as np
import math

RUNS = 100
TRAINING_N = 100
TESTING_N = 1000
ETA = 0.01


def runs():
    epoch_total = 0
    e_out_total = 0

    for run in range(RUNS):
        slope, intercept = generate_line()
        xs = np.random.uniform(-1,1, (TRAINING_N, 2))
        line_heights = (slope * xs[:, 0]) + intercept
        ys = np.where(xs[:, 1] > line_heights, 1, -1)
        xs = np.hstack((np.ones((TRAINING_N,1)), xs)) # Set x_0 = 1 for all xs
        
        weights = np.zeros(3)
        
        while True:
            # Permutation of N points
            input_perm = np.random.permutation(len(xs))
            old_weights = weights

            for input_id in input_perm:
                x_n = xs[input_id]
                y_n = ys[input_id]
                gradient = (-y_n * x_n) / (1 + math.exp(y_n * np.dot(weights.T, x_n)))

                weights = weights - (ETA * gradient)

            epoch_total +=1

            if np.linalg.norm(weights - old_weights) < 0.01:
                break
        
        # Generate 1000 points to test E-out
        xs = np.random.uniform(-1,1, (TESTING_N, 2))
        line_heights = (slope * xs[:, 0]) + intercept
        ys = np.where(xs[:, 1] > line_heights, 1, -1)
        xs = np.hstack((np.ones((TESTING_N,1)), xs)) # Set x_0 = 1 for all xs

        # e_out is the cross_entropy error
        e_out = 0
        for i in range(TESTING_N):
            x_n = xs[i]
            y_n = ys[i]
            e_out += math.log(1 + math.exp(-y_n * np.dot(weights.T, x_n)))

        e_out = e_out / TESTING_N
        e_out_total += e_out
    
    e_out_avg = e_out_total / RUNS
    epoch_avg = epoch_total / RUNS

    return (e_out_avg, epoch_avg)

def generate_line():
    rng = np.random.default_rng()
    point_1, point_2 = rng.uniform(-1,1, (2,2))
    return slope_and_intercept(point_1, point_2)

def slope_and_intercept(point_1, point_2): 
    slope = (point_2[1] - point_1[1]) / (point_2[0] - point_1[0])
    intercept = point_1[1] - (slope * point_1[0])
    return slope, intercept

runs()

(0.10293439143927566, 344.04)