# Logistic Regression with Numpy Implementation

## 1. Introduction

Logistic regression is one of the most fundamental machine learning models for binary classification. I will summarize its methodology and implement it from scratch using NumPy.

The problem we solve is **binary classification,** for example, the doctor would like to base on patients's features, including mean radius, mean texture, etc, to classify breat cancer into one of the following two case:

- "malignant":  𝑦=1 
- "benign":  𝑦=0 

which correspond to serious and gentle case respectively.

We will load the breast cancer data from scikit-learn as a toy dataset, and split the data into the training and test datasets.

## 2. Logistic Regression Model

[To be continued.]

## 3. Numpy Implementation of Logistic Regressio

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import random

import numpy as np

np.random.seed(71)

In [2]:
class LogisticRegression(object):
    """Numpy implementation of Logistic Regression."""
    def __init__(self, batch_size=64, lr=0.01, n_epochs=1000):
        self._batch_size = batch_size
        self._lr = lr
        self._n_epochs = n_epochs

    def get_dataset(self, X_train, y_train, shuffle=True):
        """Get dataset information."""
        self._X_train = X_train
        self._y_train = y_train

        # Get the numbers of examples and inputs.
        self._n_examples = self._X_train.shape[0]
        self._n_inputs = self._X_train.shape[1]

        if shuffle:
            idx = list(range(self._n_examples))
            random.shuffle(idx)
            self._X_train = self._X_train[idx]
            self._y_train = self._y_train[idx]

    def _create_weights(self):
        """Create model weights and bias."""
        self._w = np.zeros(self._n_inputs).reshape(self._n_inputs, 1)
        self._b = np.zeros(1).reshape(1, 1)

    def _sigmoid(self, logit):
        """Sigmoid function (stable version).

        sigmoid(z) = 1 / (1 + exp(-z)) 
                   = exp(z) / (1 + exp(z)) 
                   = exp(z - z_max) / (exp(-z_max) + exp(z - z_max)),
        where z is the logit, and z_max is z - max(0, z).
        """
        logit_max = np.maximum(0, logit)
        logit_stable = logit - logit_max
        return np.exp(logit_stable) / (np.exp(-logit_max) + np.exp(logit_stable))

    def _logit(self, X):
        return np.matmul(X, self._w) + self._b
    
    def _model(self, X):
        """Logistic regression model (stable version)."""
        logit = self._logit(X)
        return self._sigmoid(logit)

    def _loss(self, y, logit):
        """Cross entropy loss (stable version).
        
        cross_entropy_loss(y, z) 
          = - 1/n * \sum_{i=1}^n y_i * p(y_i = 1|x_i) + (1 - y_i) * p(y_i = 0|x_i)
          = - 1/n * \sum_{i=1}^n y_i * (z_i - log(1 + exp(z_i))) + (1 - y_i) * (-log(1 + exp(z_i))),
        where z is the logit, z_max is z - max(0, z), and log(1 + exp(z)) is the 
          logsumexp(z) = log(exp(0) + exp(z))
                       = log((exp(0) + exp(z)) * exp(z_max) / exp(z_max))
                       = z_max + log(exp(-z_max) + exp(z - z_max)).
        """
        logit_max = np.maximum(0, logit)
        logit_stable = logit - logit_max
        logsumexp_stable = logit_max + np.log(np.exp(-logit_max) + np.exp(logit_stable))
        self._cross_entropy = -(y * (logit - logsumexp_stable) + (1 - y) * (-logsumexp_stable))
        return np.mean(self._cross_entropy)

    def _optimize(self, X, y):
        """Optimize by stochastic gradient descent."""
        m = X.shape[0]

        y_hat = self._model(X) 
        dw = -1 / m * np.matmul(X.T, y - y_hat)
        db = -np.mean(y - y_hat)
        
        for (param, grad) in zip([self._w, self._b], [dw, db]):
            param[:] = param - self._lr * grad
            
    def _fetch_batch(self):
        """Fetch batch dataset."""
        idx = list(range(self._n_examples))
        for i in range(0, self._n_examples, self._batch_size):
            idx_batch = idx[i:min(i + self._batch_size, self._n_examples)]
            yield (self._X_train.take(idx_batch, axis=0), self._y_train.take(idx_batch, axis=0))

    def fit(self):
        self._create_weights()

        for epoch in range(self._n_epochs):
            total_loss = 0
            for X_train_b, y_train_b in self._fetch_batch():
                y_train_b = y_train_b.reshape((y_train_b.shape[0], -1))
                self._optimize(X_train_b, y_train_b)
                train_loss = self._loss(y_train_b, self._logit(X_train_b))
                total_loss += train_loss * X_train_b.shape[0]

            if epoch % 100 == 0:
                print('epoch {0}: training loss {1}'.format(epoch, total_loss))

        return self

    def get_coeff(self):
        return self._b, self._w.reshape((-1,))

    def predict(self, X_test):
        return self._model(X_test).reshape((-1,))

## 4. Data Preparation and Preprocessing

In [3]:
import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression as LogisticRegressionSklearn

# https://github.com/bowen0701/machine-learning/blob/master/metrics.py
from metrics import accuracy

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
# Read breast cancer data.
X, y = load_breast_cancer(return_X_y=True)

In [6]:
X.shape, y.shape

((569, 30), (569,))

In [7]:
X[:3]

array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,
        8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,
        3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,
        1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,
        1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, 1.203e+03, 1.096e-01, 1.599e-01,
        1.974e-01, 1.279e-01, 2.069e-01, 5.999e-02, 7.456e-01, 7.869e-01,
        4.585e+00, 9.403e+01, 6.150e-03, 4.006e-02, 3.832e-02, 2.058e-02,
        2.250e-02, 4.571e-03, 2.357e

In [8]:
y[:3]

array([0, 0, 0])

In [9]:
# Split data into training and test datasets.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=71, shuffle=True, stratify=y)

In [10]:
print(X_train_raw.shape, y_train.shape)
print(X_test_raw.shape, y_test.shape)

(426, 30) (426,)
(143, 30) (143,)


In [11]:
# Feature engineering for standardizing features by min-max scaler.
min_max_scaler = MinMaxScaler()

X_train = min_max_scaler.fit_transform(X_train_raw)
X_test = min_max_scaler.transform(X_test_raw)

## 4. Fitting Logistic Regression

In [12]:
# Fit our Logistic Regression.
clf = LogisticRegression(batch_size=64, lr=1, n_epochs=1000)

In [13]:
# Get datasets and build graph.
clf.get_dataset(X_train, y_train, shuffle=True)

In [14]:
clf.fit()

epoch 0: training loss 250.96447937427212
epoch 100: training loss 44.668697116763894
epoch 200: training loss 36.209731749496534
epoch 300: training loss 32.444012323538125
epoch 400: training loss 30.183066134771742
epoch 500: training loss 28.62232868981366
epoch 600: training loss 27.45694757432272
epoch 700: training loss 26.54273765051421
epoch 800: training loss 25.80090551663319
epoch 900: training loss 25.18382077461287


<__main__.LogisticRegression at 0x7f5e8227ecf8>

In [15]:
# Get coefficient.
clf.get_coeff()

(array([[16.52201316]]),
 array([-1.19144577, -3.38372561, -1.25607229, -2.63248332, -1.47542685,
         1.54943377, -4.65400075, -6.75717894, -1.83555175,  2.89953247,
        -7.98049311, -0.75065766, -6.32759953, -4.76821632,  1.50983303,
         4.35688409,  1.92112779,  1.19876298,  2.97012969,  3.19716521,
        -5.61555198, -4.95308185, -4.97166847, -5.27033532, -3.35721015,
        -0.44892995, -4.5583347 , -4.92475631, -3.19141162, -1.24456652]))

In [16]:
# Predicted probabilities for training data.
p_pred_train = clf.predict(X_train)
p_pred_train[:3]

array([9.95419367e-01, 1.26388679e-12, 1.19980720e-04])

In [17]:
# Predicted labels for training data.
y_pred_train = (p_pred_train > 0.5) * 1
y_pred_train[:3]

array([1, 0, 0])

In [18]:
# Predicted label correctness for training data.
# y_pred_train == y_train

In [19]:
# Prediction accuracy for training data.
accuracy(y_train, y_pred_train)

0.9859154929577465

In [20]:
# Predicted label correctness for test data.
p_pred_test = clf.predict(X_test)
y_pred_test = (p_pred_test > 0.5) * 1

# y_pred_test == y_test

In [21]:
# Prediction accuracy for test data.
accuracy(y_test, y_pred_test)

0.972027972027972

## 5. Fitting Sklearn's Logistic Regression as Benchmark

In [22]:
# Fit sklearn's Logistic Regression.
clf2 = LogisticRegressionSklearn(C=1e4, solver='lbfgs', max_iter=500)

clf2.fit(X_train, y_train)

LogisticRegression(C=10000.0, max_iter=500)

In [23]:
# Get coefficients.
clf2.intercept_, clf2.coef_

(array([56.06250509]),
 array([[  53.5460616 ,  -27.2575739 ,   48.30697654,   10.5636878 ,
          -14.75837806,   98.5009966 ,  -52.51936527,  -52.16906591,
           -5.08742246,  -53.96348797,  -33.97198842,   -5.48905184,
          -19.38885928,  -43.89981909,   38.75665922,  -51.43678914,
           83.21007672,  -21.89925037,   14.96797392,   79.99757062,
          -59.04206865,   -3.91791317,  -63.58395555, -103.96747709,
           -7.9699581 ,   20.04904076,  -21.96650031,  -21.30939901,
          -21.55187209,  -11.69936363]]))

In [24]:
# Predicted labels for training data.
y_pred_train = clf2.predict(X_train)
y_pred_train[:3]

array([1, 0, 0])

In [25]:
# Predicted label correctness for training data.
# y_pred_train == y_train

In [26]:
# Prediction accuracy for training data.
accuracy(y_train, y_pred_train)

1.0

In [27]:
# Predicted label correctness for test data.
y_pred_test = clf2.predict(X_test) 
# y_pred_test == y_test

In [28]:
# # Prediction accuracy for test data.
accuracy(y_test, y_pred_test)

0.965034965034965