## Logistic Regression techniques

When working with logistic regression, I have seen
algorithms where $y_i \in \{-1, 1\}$, and others
where $y_i \in \{0, 1\}$.  This leads to slightly
different sets of equations, and indeed in code,
there seems to be a difference in performance.

### Shared code

In [None]:
import typing
import numpy as np


def num_incorrect(labels, predictions):
    c1 = (predictions > 0.5) & (labels == 1)
    # accounting for both methods here
    c2 = (predictions < 0.5) & (labels == 0 | labels == -1)

    return np.count_nonzero(~np.logical_or(c1, c2))


def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))


def init_params(num_features: int) -> np.ndarray:
    return np.random.randn(num_features)


alpha = 0.01
grad_desc_cycles = 100
X_train, Y_train = np.zeros((5, 10)), np.ones(5)
w = init_params(len(X_train))


### Method 1

Case: $y_i \in \{0, 1\}$

Use gradient descent on the actual errors;
I think this is the cleaner approach.

In [None]:
for i in range(grad_desc_cycles):
    z = w.dot(X_train)
    predictions = sigmoid(z)
    errors = Y_train - predictions
    grad = errors.dot(X_train.T) / len(Y_train)
    w = w + alpha * grad

    if i % 10: continue
    print(np.count_nonzero(~num_incorrect(Y_train, predictions)))


### Method 2: MLE Negative Log Likelihood

Case: $y_i \in \{-1, 1\}$

#### MLE

Choose parameters that maximize the conditional likelihood.
The conditional data likelihood $P(\mathbf{y} | X, \mathbf{w})$
is the probability of the observed values $\mathbf{y} \in \mathbb{R}^n$
in the training data conditioned on the feature values $\mathbf{x}_i$.
Note that $X = [\mathbf{x}_1,\dots,\mathbf{x}_n] \in \mathbb{R}^{d \times n}$.
We choose the parameters that maximize this function, and we assume that
the $y_i$ are independent given the input features $\mathbf{x}_i$ and $\mathbf{w}$.

$$
    P(\mathbf{y} | X, \mathbf{w}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i, \mathbf{w}) \\
    \hat{\mathbf{w}}_{\text{MLE}}
    = \underset{\mathbf{w}}{\arg\max}
    - \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i}) \\
    = \underset{\mathbf{w}}{\arg\min} \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

Use gradient descent on the _negative log likelihood_.

$$
    \ell(\mathbf{w}) = \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

In [None]:
def loss(Y: np.ndarray, z: np.ndarray):
    return np.log(1 + np.exp(-Y * z)) / len(Y)


def grad(Y, w, X):
    # The commented out return statements are all the same;
    # you can see the simplifications as I figured out
    # they were equivalent.
    # return np.sum(Y * (sigmoid(w.dot(X)) - 1) * X, axis=1) / len(Y)
    # return (Y * (sigmoid(w.dot(X)) - 1)).dot(X.T) / len(Y)
    # return sigmoid(Y * w.dot(X)).dot(X.T) / len(Y)
    z = w.dot(X)
    return sigmoid(Y * z).dot(X.T) / len(Y)


for i in range(grad_desc_cycles):
    w = w - alpha * grad(Y_train, w, X_train)

    if i % 10: continue
    predictions = sigmoid(w.dot(X_train))
    print(np.count_nonzero(~num_incorrect(Y_train, predictions)))
    print(np.sum(loss(Y_train, z)), np.around(w, 3))
