# Logistic Regression

let:

$Pr(C1|x) = σ(w^T x+w_0)$ 

and 

$Pr(C2|x) = 1−σ(w^T x+w_0)$

Learn the parameters w and w0 by conditional likelihood maximization. More specifically use Newton’s algorithm
derived in class to optimize the parameters. 10 iterations of Newton’s algorithm should be sufficient for
convergence. Add a penalty of $0.5λ||w||_2^2$
to regularize the weights. Find the optimal hyperparameter λ
by 10-fold cross-validation.

## TODO
- Draw a graph that shows the cross-validation accuracy of logistic regression as λ varies. Report the best λ.
- Report the accuracy of logistic regression (with the best λ for regularization) on
the test set. Measure the accuracy by counting the average number of correctly labeled images. An image
is correctly labeled when the probability of the correct label is greater than 0.5.
- Print also the parameters w, w0 found for logistic regression.
- Briefly discuss the results:
    - Mixture of Gaussians and logistic regression both find a linear separator, but they use different parameterizations and different objectives. Compare the number of parameters in each model and the
amount of computation needed to find a solution with each model. Compare the results for each
model.
    - Mixture of Gaussians and logistic regression find a linear separator where as k-Nearest Neighbours
(in assignment 1) finds a non-linear separator. Compare the expressivity of the separators. Discuss
under what circumstances each type of separator is expected to perform best. What could explain
the results obtained with KNN in comparison to the results obtained with mixtures of Gaussians and
linear regression?

# Logistic Regression
- HTF: 4.4, 5.6
- N the number of data points
- p the number of features
- y denote the vector of yi values (yi = 1 for class 1 and yi = 0 for class 2)
- X the N × (p + 1) matrix of xi values
- p_xw the vector of fitted probabilities with ith element p(xi;βold)
- R a N×N diagonal matrix of weights with ith diagonal element p(xi;βold)(1− p(xi; βold))

In [None]:
! git clone https://github.com/abnercorrea/machine-learning.git

Cloning into 'machine-learning'...
remote: Enumerating objects: 98, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 98 (delta 37), reused 84 (delta 23), pack-reused 0[K
Unpacking objects: 100% (98/98), done.


In [None]:
import sys
sys.path += ['/content/machine-learning/src']

In [None]:
import numpy as np

import tensorflow as tf

from abnercorrea.numpy.util.data_prep import read_train_data, read_test_data, norm, prepend_col, to_binary_classes
from abnercorrea.numpy.util.data_vis import plot_alpha_scores

from abnercorrea.tensorflow.util.stat import logistic_sigmoid
from abnercorrea.tensorflow.util.data_prep import split_train_validation_tf
from abnercorrea.tensorflow.util.tensorflow import tf_while_loop_body

# from abnercorrea.tensorflow.linear.logistic_regression import LogisticRegressionClassifierTF


# Data

In [None]:
# read data
xtrp, ytrp = read_train_data(num_partitions=10)
xtr, ytr = np.concatenate(xtrp), np.concatenate(ytrp)
xte, yte = read_test_data()

# predictors are all standardized to have mean zero and unit norm.
xtr, xte = norm(xtr - xtr.mean()), norm(xte - xte.mean())

# yi will be used as a scalar
ytr, yte = ytr[:, 0], yte[:, 0]

# X bar has 1 for the first dimension of X to accommodate for w0
xtr_, xte_ = prepend_col(xtr, 1), prepend_col(xte, 1)

# y denote the vector of yi values (yi = 1 for class 1 and yi = 0 for class 2)
ytrb, classes = to_binary_classes(ytr)
yteb, _ = to_binary_classes(yte)

In [None]:
xtr_.shape, xte_.shape, ytrb.shape, yteb.shape

((1000, 65), (110, 65), (1000,), (110,))

# Model

In [None]:
class LogisticRegressionClassifierTF:
    def __init__(self, optimizer='Newton-Raphson', max_iter=10, tol=1e-14, learning_rate=1e-2):
        assert optimizer in ['Newton-Raphson', 'sgd'], f'Optimizer {optimizer} not supported.'

        self.optimizer = optimizer
        self.tol = tol
        self.max_iter = max_iter
        self.learning_rate = learning_rate
        self.w, self.alpha, self.alpha_scores = None, None, None

    def fit(self, X_, y, alphas=None, folds=10):
        params = self.fit_tf(X_, y, alphas, folds, self.max_iter, self.tol, self.learning_rate)
        
        self.w, self.alpha, self.alpha_scores = params

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.int32),
            tf.TensorSpec(shape=(), dtype=tf.int32),
            tf.TensorSpec(shape=(), dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.float64),
        ]
    )
    def fit_tf(self, X_, y, alphas, folds, max_iter, tol, learning_rate):
        """
        Trains model and finds weights w and hyper-parameter alpha (regularization).

        :param y: denotes the vector of yi values (yi = 1 for class 1 and yi = 0 for class 2)
        """
        alphas_size = tf.size(alphas)
        scores = tf.TensorArray(tf.float64, size=alphas_size)
        scores = scores.unstack(tf.zeros_like(alphas))

        for fold in tf.range(folds, dtype=tf.int32):
            # splits train and validation sets
            xtr, ytr, xvl, yvl = split_train_validation_tf(X_, y, fold, folds)
            for i in tf.range(alphas_size):
                # fits model using trainig set
                w = self.optimize(xtr, ytr, alphas[i], max_iter=max_iter, tol=tol, learning_rate=learning_rate)
                # calculates score using validation set
                score = self.score_tf(xvl, yvl, w)
                # accumulates score of each alpha over all folds
                scores = scores.write(i, scores.read(i) + score)

        alpha_scores = scores.stack() / folds
        alpha = alphas[tf.argmax(alpha_scores)]
        # trains with best alpha using all train data
        w = self.optimize(X_, y, alpha, max_iter=max_iter, tol=tol, learning_rate=learning_rate)
        return w, alpha, alpha_scores

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.int32),
            tf.TensorSpec(shape=(), dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.float64),
        ]
    )
    def optimize(self, X_, y, alpha, max_iter, tol, learning_rate):
        """
        This maximizes a penalized log-likelihood(w)

        The penalty used is: -.5 * alpha * w^2 (L2)

        We typically do not penalize the intercept term, and standardize the predictors for the penalty to be meaningful.

        It seems that w = 0 is a good starting value for the iterative procedure, although convergence is never guaranteed.
        Typically the algorithm does converge, since the log-likelihood is concave, but overshooting can occur.
        In the rare cases that the log-likelihood decreases, step size halving will guarantee convergence.

        This algorithm is referred to as iteratively reweighted least squares or IRLS.

        TODO: implement step-size halving in case of not converging
        """
        f = tf.shape(X_)[1]
        # w = 0 is a good starting value for the iterative procedure
        w0 = tf.zeros(shape=[f], dtype=tf.float64)
        # since gradient is used to check convergence, this guarantees at least 1 iteration.
        gradient0 = tf.fill(dims=[f], value=tol * 2)

        # checks convergence (gradient = 0 which numerically becomes gradient < tol)
        def not_converged(wi, gradient):
            return tf.reduce_any(tf.abs(gradient) >= tol)

        @tf_while_loop_body()
        def newton_raphson(wi, gradient):
            """
            From HTF:
            The Newton–Raphson algorithm uses the first-derivative or Gradient and the second-derivative or Hessian matrix.

            Staring with w_old, the Newton step is:
            w_new = w_old - inverse(hessian(log_likelihood(w))) @ gradient(log_likelihood(w))
            """
            gradient, p = self.gradient(X_, y, wi, alpha)
            hessian = self.hessian(X_, p, alpha)
            hessian_inv = tf.linalg.inv(hessian)
            # w_new = w - inverse(hessian(log_likelihood(w))) @ gradient(log_likelihood(w))
            wi -= hessian_inv @ gradient
            return [wi, gradient]

        # TODO: SGD
        # @tf_while_loop_body()
        # def sgd(wi, gradient):
        #     """
        #     SGD step calculates the gradient to update the value of w.
        # 
        #     Staring with w_old, the Newton step is:
        #     w_new = w_old - learning_rate * gradient(log_likelihood(w))
        #     """
        #     gradient, _ = self.gradient(X_, y, wi, alpha)
        #     # w_new = w_old - learning_rate * gradient(log_likelihood(w))
        #     wi -= learning_rate * gradient
        #     return [wi, gradient]

        # since back_prop is not needed, using tf.stop_gradient to prevent gradient computation.
        [w, _] = tf.nest.map_structure(
            tf.stop_gradient,
            tf.while_loop(
                cond=not_converged,
                body=newton_raphson,
                loop_vars=[w0, gradient0],
                maximum_iterations=max_iter
            )
        )
        return w

    def score(self, X_, y):
        return self.score_tf(X_, y, self.w)

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
        ]
    )
    def score_tf(self, X_, y, w):
        """
        Score used is the accuracy of the model. (true positive + true negative rate)
        """
        predictions = self.predict_tf(X_, w)
        accurate = tf.reduce_sum(tf.where(y == predictions, 1., 0.))
        n = tf.size(y)
        accuracy = accurate / n
        return accuracy

    def predict_proba(self, X_):
        return self.predict_proba_tf(X_, self.w)

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
        ]
    )
    def predict_proba_tf(self, X_, w):
        """
        Returns probability p of class 1
        If p >= 0.5, then the predicted class is class 1 otherwise it's class 2.
        """
        xw = X_ @ w
        p = logistic_sigmoid(xw)
        return p

    def predict(self, X_):
        return self.predict_tf(X_, self.w)

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
        ]
    )
    def predict_tf(self, X_, w):
        """
        Prediction is a dot product... X_ @ w
        Predicted values follow the same convention used to create the y vector:
        1 = Class 1
        0 = Class 2
        """
        xw = X_ @ w
        predictions = tf.where(xw >= 0, 1, 0)
        return predictions

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.float64),
        ]
    )
    def gradient(self, X_, y, w, alpha):
        """
        First-derivative or Gradient of log-likelihood (w)

        y denote the vector of yi values (yi = 1 for class 1 and yi = 0 for class 2)
        p vector of fitted probabilities p(xi; w)

        Gradient = X.T * (y - p) - alpha * w
        """
        xw = X_ @ w
        # vector of fitted probabilities p(xi; w)
        p = logistic_sigmoid(xw)
        # gradient vector
        gradient = X_.T @ (y - p) - alpha * w
        return gradient, p

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=None, dtype=tf.float64),
            tf.TensorSpec(shape=(), dtype=tf.float64),
        ]
    )
    def hessian(self, X_, p, alpha):
        """
        Second-derivative or Hessian matrix of log-likelihood(w).

        Hessian = X.T @ R @ X + alpha * I
        """
        # n: # of data points
        # f: # of features + 1
        x_shape = tf.shape(X_)
        n, f = x_shape[0], x_shape[1]
        # R is a N×N diagonal matrix of weights with ith diagonal element sigmoid(xi; w)(1 − sigmoid(xi; w))
        # The derivative of the sigmoid is sigmoid * (1 - sigmoid)
        sigmoid_deriv = p * (1 - p)
        R = tf.linalg.set_diag(tf.zeros([n, n], dtype=tf.float64), sigmoid_deriv)
        # X_.T @ R @ X_ shape is (f, f)
        H = -X_.T @ R @ X_ - alpha * tf.eye(f)
        return H


# Testing

In [None]:
lrc = LogisticRegressionClassifierTF(tol=1e-6, max_iter=500, learning_rate=1)

In [None]:
alphas = np.arange(0, 20, 1, dtype=np.float64)

In [None]:
%time lrc.fit(xtr_, ytrb, alphas=alphas)

CPU times: user 14.4 s, sys: 441 ms, total: 14.9 s
Wall time: 8.72 s


In [None]:
plot_alpha_scores(alphas, lrc.alpha_scores.numpy())

In [None]:
lrc.alpha.numpy()

array([1.])

In [None]:
lrc.score(xte_, yteb).numpy()

0.9090909090909091

In [None]:
lrc.predict_proba(xte_).numpy(), yteb, yte, lrc.w

(array([0.1082648 , 0.06054096, 0.5913902 , 0.06331777, 0.15059279,
        0.44515334, 0.90891174, 0.89695178, 0.93672377, 0.06562928,
        0.52674758, 0.00805518, 0.61544615, 0.24889922, 0.13934653,
        0.09331497, 0.10788664, 0.91536926, 0.89616293, 0.19206788,
        0.1714575 , 0.15241934, 0.86452804, 0.10124173, 0.48193289,
        0.31092134, 0.93900133, 0.9220255 , 0.73984795, 0.41429229,
        0.26368699, 0.65141384, 0.08352205, 0.67039227, 0.08296025,
        0.03331867, 0.73363472, 0.05103853, 0.49635781, 0.98890182,
        0.03049984, 0.5676605 , 0.91905741, 0.17436481, 0.07005026,
        0.76963967, 0.2154854 , 0.55858888, 0.06634599, 0.89633215,
        0.48406254, 0.00926959, 0.47770605, 0.57905203, 0.31815713,
        0.35217281, 0.95525607, 0.61309419, 0.8006296 , 0.50414201,
        0.53501233, 0.20793392, 0.03972402, 0.02601745, 0.26525062,
        0.06591946, 0.82025277, 0.93247321, 0.87355866, 0.14064811,
        0.9752388 , 0.12586524, 0.06018296, 0.70

In [None]:
%timeit lrc.predict(xte_)

The slowest run took 11.75 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 686 µs per loop


In [None]:
%timeit lrc.predict_proba(xte_)

The slowest run took 14.33 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 601 µs per loop


# Data pipeline with Datasets

In [None]:
record_defaults = [0] * 64  # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("testData.csv", record_defaults, header=False)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

<MapDataset shapes: (64,), types: tf.int32>

In [None]:
for line in dataset.take(1):
  print(line.numpy())

[ 0  0  0  1 13 10  0  0  0  8  0 10  2  5  0  5  5  2  2 14  0  5  0  0
  4  0  8  0  7 16  6  0  0  0 11 15  7  5  5  4  0  0  0  8  0  0 11  0
  0  5 16 12  0  9 16  0  0  0  4  0 14  0 15  0]
