# Logistic Regression

Consider a coin with probability of tails $\pi$ and the probability of heads $1-\pi$. We denote the outcome of the coin toss as $y$. Denoting the tail with a $1$ and head with $0$, we write the probability of tails with $p(y = 1) = \pi$ and probability of heads is $p(y = 0) = 1-\pi$. More compactly, the probability of the outcome of a toss, provided we know $\pi$ is written as
\begin{eqnarray}
p(y|\pi) = \pi^y(1-\pi)^y
\end{eqnarray}

In logistic regression, we are given a dataset of form
\begin{eqnarray}
X & = &  \begin{pmatrix}
  x_{1,1} & x_{1,2} & \dots & x_{1,D} & x_{1,D+1}\\
  x_{2,1} & x_{2,2} & \dots & x_{2,D} & x_{2,D+1}\\
  \vdots & \vdots & \vdots & \vdots  &  \vdots\\
  x_{i,1} & x_{i,2} & \dots & x_{i,D}  & x_{i,D+1}\\
  \vdots & \vdots & \vdots & \vdots & \vdots \\
  x_{N,1} & x_{N,2} & \dots & x_{N,D} & x_{N,D+1}\\
\end{pmatrix} \\
\mathbf{y} & = & \begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_i \\
\vdots \\
y_N
\end{pmatrix}
\end{eqnarray}
where $x_{i,j}$ denotes the $j$'th feature of the $i$'th data point. The $D+1$ column, where $x_{i,D+1}=1$ for all $i$, is artificially added to the dataset to allow for a bias. The $y_i$ denote the target class label of the
$i$'th object. In logistic regression, we consider the case of binary classification where $y_i \in \{0,1\}$ (or $y_i \in \{-1,1\}$).

To understand logistic regression, consider the following metaphor: assume that for each data instance $x_i$, we select a biased coin with probability $p(y_i = 1| w, x) = \pi_i = \sigma(w^\top x_i)$, throw the coin and label the data item with class $y_i$ accordingly. Here,
$\sigma(x)$ is the sigmoid function defined as
\begin{eqnarray}
\sigma(x) & = & \frac{1}{1+e^{-x}}
\end{eqnarray}

### Properties of the sigmoid function
Note that

\begin{eqnarray}
\sigma(x) & = & \frac{e^x}{(1+e^{-x})e^x} = \frac{e^x}{1+e^{x}} \\
1 - \sigma(x) & = & 1 - \frac{e^x}{1+e^{x}} = \frac{1+e^{x} - e^x}{1+e^{x}} = \frac{1}{1+e^{x}}
\end{eqnarray}

\begin{eqnarray}
\sigma'(x) & = & \frac{e^x(1+e^{x}) - e^{x} e^x}{(1+e^{x})^2} = \frac{e^x}{1+e^{x}}\frac{1}{1+e^{x}} = \sigma(x) (1-\sigma(x))
\end{eqnarray}

\begin{eqnarray}
\log \sigma(x) & = & -\log(1+e^{-x}) = x - \log(1+e^{x}) \\
\log(1 - \sigma(x)) & = &  -\log({1+e^{x}})
\end{eqnarray}




Exercise: Plot the sigmoid function and its derivative.

The likelihood of the observations, that is the probability of observing the class sequence is
\begin{eqnarray}
p(y_1, y_2, \dots, y_N|w, x_1, x_2, \dots, x_N ) &=& \left(\prod_{i : y_i=1} \sigma(w^\top x_i) \right) \left(\prod_{i : y_i=0}(1- \sigma(w^\top x_i)) \right)
\end{eqnarray}
Here, the left product is the expression for examples from class $1$ and the right product is for examples from class $0$.
We will look for the particular setting of the weight vector, the so called maximum likelihood solution, denoted by $w^*$.
\begin{eqnarray}
w^* & = & \arg\max_{w} {\cal L}(w)
\end{eqnarray}
where the loglikelihood function
\begin{eqnarray}
{\cal L}(w) & = & \log p(y_1, y_2, \dots, y_N|w, x_1, x_2, \dots, x_N ) \\
& = & \sum_{i : y_i=1} \log \sigma(w^\top x_i) + \sum_{i : y_i=0} \log (1- \sigma(w^\top x_i)) \\
& = & \sum_{i : y_i=1} w^\top x_i - \sum_{i : y_i=1} \log(1+e^{w^\top x_i}) - \sum_{i : y_i=0}\log({1+e^{w^\top x_i}}) \\
& = & \sum_i y_i w^\top x_i - \sum_{i} \log(1+e^{w^\top x_i}) \\
& = & y^\top X w - \mathbf{1}^\top logsumexp(0, X w)
\end{eqnarray}

Unlike the least-squares problem, an expression for direct evaluation of $w^*$ is not known so we need to resort to numerical optimization. 

One way for
optimization is gradient ascent
\begin{eqnarray}
w^{(\tau)} & \leftarrow & w^{(\tau-1)} + \eta \nabla_w {\cal L}
\end{eqnarray}
where
\begin{eqnarray}
\nabla_w {\cal L} & = &
\begin{pmatrix}
{\partial {\cal L}}/{\partial w_1} \\
{\partial {\cal L}}/{\partial w_2} \\
\vdots \\
{\partial {\cal L}}/{\partial w_{D+1}}
\end{pmatrix}
\end{eqnarray}
is the gradient vector.

\subsection{Evaluating the gradient}
The partial derivative of the loglikelihood with respect to the $k$'th entry of the weight vector is given by the chain rule as
\begin{eqnarray}
\frac{\partial{\cal L}}{\partial w_k} & = & \frac{\partial{\cal L}}{\partial \sigma(u)} \frac{\partial \sigma(u)}{\partial u} \frac{\partial u}{\partial w_k}
\end{eqnarray}

\begin{eqnarray}
{\cal L}(w) & = & \sum_{i : y_i=1} \log \sigma(w^\top x_i) + \sum_{i : y_i=0} \log (1- \sigma(w^\top x_i))
\end{eqnarray}

\begin{eqnarray}
\frac{\partial{\cal L}(\sigma)}{\partial \sigma} & = &  \sum_{i : y_i=1} \frac{1}{\sigma(w^\top x_i)} - \sum_{i : y_i=0} \frac{1}{1- \sigma(w^\top x_i)}
\end{eqnarray}

\begin{eqnarray}
\frac{\partial \sigma(u)}{\partial u} & = & \sigma(w^\top x_i) (1-\sigma(w^\top x_i))
\end{eqnarray}

\begin{eqnarray}
\frac{\partial w^\top x_i }{\partial w_k} & = & x_{i,k}
\end{eqnarray}


So the gradient is
\begin{eqnarray}
\frac{\partial{\cal L}}{\partial w_k} & = & \sum_{i : y_i=1} \frac{\sigma(w^\top x_i) (1-\sigma(w^\top x_i))}{\sigma(w^\top x_i)} x_{i,k} - \sum_{i : y_i=0} \frac{\sigma(w^\top x_i) (1-\sigma(w^\top x_i))}{1- \sigma(w^\top x_i)} x_{i,k} \\
& = & \sum_{i : y_i=1} {(1-\sigma(w^\top x_i))} x_{i,k} - \sum_{i : y_i=0} {\sigma(w^\top x_i)} x_{i,k}
\end{eqnarray}

We can write this expression more compactly by noting
\begin{eqnarray}
\frac{\partial{\cal L}}{\partial w_k} & = & \sum_{i : y_i=1} {(\underbrace{1}_{y_i}-\sigma(w^\top x_i))} x_{i,k} + \sum_{i : y_i=0} {(\underbrace{0}_{y_i} - \sigma(w^\top x_i))} x_{i,k} \\
& = & \sum_i (y_i - \sigma(w^\top x_i)) x_{i,k}
\end{eqnarray}

The update rule is
\begin{eqnarray}
w^{(\tau)} = w^{(\tau-1)} + \eta X^\top (y-\sigma(X w))
\end{eqnarray}




In [1]:
%matplotlib inline
from cvxpy import *
import numpy as np
import matplotlib as mpl
import matplotlib.pylab as plt



In [25]:
x = np.matrix('[-2,1; -1,2; 1,5; -1,1; -3,-2; 1,1] ')
y = np.matrix('[0,0,1,0,0,1]').T


N = x.shape[0]
#A = np.hstack((np.power(x,0), np.power(x,1), np.power(x,2)))
X = np.hstack((x, np.ones((N,1)) ))

K = X.shape[1]
z = np.zeros((N,1))

print(y)
print(X)

[[0]
 [0]
 [1]
 [0]
 [0]
 [1]]
[[-2.  1.  1.]
 [-1.  2.  1.]
 [ 1.  5.  1.]
 [-1.  1.  1.]
 [-3. -2.  1.]
 [ 1.  1.  1.]]


In [35]:
# Construct the problem.
w = Variable(K)
objective = Minimize(norm(w, 1) -y.T*X*w + sum_entries(log_sum_exp(hstack(z, X*w),axis=1)))
#constraints = [0 <= x, x <= 10]
#prob = Problem(objective, constraints)
prob = Problem(objective)

# The optimal objective is returned by prob.solve().
result = prob.solve()
# The optimal value for x is stored in x.value.
print(w.value)
# The optimal Lagrange multiplier for a constraint
# is stored in constraint.dual_value.
#print(constraints[0].dual_value)

#plt.show()

[[  1.35076889e+00]
 [  1.68577115e-09]
 [ -1.61815400e-10]]


In [38]:

log_sum_exp(w) 

Expression(CONVEX, UNKNOWN, (1, 1))