# Gradient descent for L2 logistic regression

Author: Alexandre Gramfort

### Logistic regression model

the log ratio of class probabilities is a linear function of the features:

$$
\log \left(\frac{\mathbb{P}\{Y=+1 \mid X=x\}}{\mathbb{P}\{Y=-1 \mid X=x\}}\right)
= x^\top w
$$

Decision function:

$$
x^\top w > 0 \Rightarrow y = 1
$$

It is a **linear** function of the features !

We then can get the conditional probabilities:

$$
\mathbb{P}\{Y=-1 \mid X=x\} = \frac{\exp(x^\top w)}{1 + \exp(x^\top w)}
$$

$$
\mathbb{P}\{Y=1 \mid X=x\} = \frac{1}{1 + \exp(x^\top w)}
$$

In practice $w$ is computed by maximizing the likelihood of the training data under this model. It reads:

$$
\hat{w} = argmin_{w} \sum_{i=1}^n \sum_k 1_{\{Y_i = k\}} \log (\mathbb{P}\{Y=k \mid X=x_i, w \})
$$

One can show that it leads with y=1 or y=-1 to:

$$
\hat{w} = argmin_{w} \sum_{i=1}^n \log \{1 + \exp(-y_i(x_i^\top w) \})
$$

With L2 regularization and an hyperparameter $\lambda$ it leads to:

$$
\hat{w} = argmin_{w} \sum_{i=1}^n \log \{1 + \exp(-y_i(x_i^\top w) \}) + \frac{\lambda}{2} \|w\|^2
$$

In [None]:
%matplotlib inline
import math
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

# Make it binary
X = X[y < 2]
y = y[y < 2]

# add intercept column of ones
# X = np.concatenate((X, np.ones((X.shape[0], 1))), axis=1)

y[y == 0] = -1

In [None]:
plt.scatter(X[y > 0, 0], X[y > 0, 1], color='r')
plt.scatter(X[y < 0, 0], X[y < 0, 1], color='b')

In [None]:
def f(w):
    pobj = np.sum(np.log(1. + np.exp(- y * np.dot(X, w))))
    return pobj

def f_grad(w):
    ywTx = y * np.dot(X, w)
    temp = 1. / (1. + np.exp(ywTx))
    grad = -np.dot(X.T, (y * temp))
    return grad

from scipy.optimize import check_grad
check_grad(f, f_grad, np.random.randn(2))

In [None]:
def grad_descent(f, f_grad, w0, step_size=0.01, max_iter=0):
    """Gradient descent with constant step size"""
    w = w0.copy()
    fws = []
    for k in range(max_iter):
        w -= step_size * f_grad(w)
        fws.append(f(w))
    return w, fws

n_features = X.shape[1]
x0 = np.zeros(n_features)
w_hat, fws = grad_descent(f, f_grad, x0, step_size=0.001, max_iter=5000)

plt.plot(fws, 'b')
plt.xlabel('Iterations')
plt.ylabel('Objective')
plt.show()

In [None]:
np.sign(np.dot(X, w_hat)) - y

In [None]:
plt.scatter(X[y > 0, 0], X[y > 0, 1], color='r')
plt.scatter(X[y < 0, 0], X[y < 0, 1], color='b')
xx = np.linspace(4, 8, 10)
plt.plot(xx,  - xx * w_hat[0] / w_hat[1], 'k');

<div class="alert alert-success">
    <b>QUESTION 1:</b>
     <ul>
       <li>Modify f and f_grad to add support for the regularization.
           Check your gradient with scipy.optimize.check_grad</li>
       <li>Just for fun check your gradient against PyTorch based on automatic differentiation. Which implementation is the most accurate?</li>
    </ul>
</div>