# Logistic Regression with Implementation (WIP)

- Bowen Li
- 2018/08/20

## 1. Introduction

Logistic regression is one of the most fundamental machine learning models for binary classification. I will summarize its methodology and implement it from scratch using NumPy.

### Binary classification

For example, the doctor would like to base on patients's features, including mean radius, mean texture, etc, to classify  breat cancer into one of the following two case: 
- "malignant": $y = 1$
- "benign": $y = 0$

which correspond to serious and gentle case respectively. 

We would like to load the breast cancer data from scikit-learn as a toy dataset, and split the data into the training and test datasets.

## 2. Methodology

In [2]:
# TODO: Summarize methodology.

## 3. Breast Cancer Dataset

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import sys
import itertools

import numpy as np
import scipy as sp
import pandas as pd

In [2]:
from sklearn.datasets import load_breast_cancer
bc_data = load_breast_cancer()

In [3]:
RANDOM_SEED = 71
TRAIN_PERCENT = 0.7

In [4]:
features = bc_data.get('feature_names')
features = ['_'.join(x.split()) for x in features]
X = bc_data.get('data')
# X = X.reshape((X.shape[1], X.shape[0]))

print('feature_names: \n{}'.format(features))
print('X: \n{}'.format(X))

print('X.shape: {}'.format(X.shape))

feature_names: 
['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area', 'mean_smoothness', 'mean_compactness', 'mean_concavity', 'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension', 'radius_error', 'texture_error', 'perimeter_error', 'area_error', 'smoothness_error', 'compactness_error', 'concavity_error', 'concave_points_error', 'symmetry_error', 'fractal_dimension_error', 'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area', 'worst_smoothness', 'worst_compactness', 'worst_concavity', 'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension']
X: 
[[  1.79900000e+01   1.03800000e+01   1.22800000e+02 ...,   2.65400000e-01
    4.60100000e-01   1.18900000e-01]
 [  2.05700000e+01   1.77700000e+01   1.32900000e+02 ...,   1.86000000e-01
    2.75000000e-01   8.90200000e-02]
 [  1.96900000e+01   2.12500000e+01   1.30000000e+02 ...,   2.43000000e-01
    3.61300000e-01   8.75800000e-02]
 ..., 
 [  1.66000000e+01   2.80800000e+01   1.08300000e+02 ...,   1

In [5]:
target = bc_data.get('target_names')
y = bc_data.get('target')

print('target_names: {}'.format(target))
print('target: \n{}'.format(y))

print('y: {}'.format(y.shape))

target_names: ['malignant' 'benign']
target: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1

We perform basic EDA for the breast cancer data.

In [6]:
# EDA for numbers of malignant and benign.
print('Number of malignant: {}'.format((y == 0).sum()))
print('Number of benign: {}'.format((y == 1).sum()))

Number of malignant: 212
Number of benign: 357


In [7]:
# EDA for feature matrix.
pd.DataFrame(X, columns=features).describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_radius,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [8]:
def normalize_feature(x, axis=0):
    """Implement a function that normalizes each col or row of the matrix x 
    to have unit length.
    
    Args:
      x: A numpy matrix of shape (n, m).
      axis: A integer in {0, 1}, 
        - 0: normalize for each feature col.
        - 1: normalize for each feature row. 
    
    Returns:
      x_normalized: The normalized (by row) numpy matrix.
    """
    # Compute x_norm as the norm 2 of x.
    x_norm = np.linalg.norm(x, axis=axis, ord=2, keepdims=True)
    # Divide x by its norm.
    x_normalized = x / x_norm
    return x_normalized

In [9]:
X = normalize_feature(X)

print('Normalized X: {}'.format(X))
print('Normalized X.shape: {}'.format(X.shape))

# EDA for normalized feature matrix.
pd.DataFrame(X, columns=features).describe()

Normalized X: [[ 0.05180005  0.02201907  0.0541219  ...,  0.08423164  0.06503422
   0.05805201]
 [ 0.05922885  0.03769547  0.05857329 ...,  0.05903197  0.0388707
   0.04346333]
 [ 0.056695    0.04507758  0.05729517 ...,  0.07712242  0.05106903
   0.04276026]
 ..., 
 [ 0.04779771  0.05956605  0.04773128 ...,  0.04500395  0.03135099
   0.03818055]
 [ 0.05931523  0.06221767  0.06174656 ...,  0.08410469  0.05776893
   0.06054205]
 [ 0.02234399  0.05205665  0.02111988 ...,  0.          0.04058101
   0.03436738]]
Normalized X.shape: (569, 30)


Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_radius,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.040678,0.040919,0.040534,0.036935,0.041483,0.03741,0.031208,0.032855,0.041451,0.04166,...,0.040189,0.040772,0.040008,0.035214,0.041313,0.035658,0.033284,0.036373,0.041002,0.040986
std,0.010147,0.009124,0.010709,0.019848,0.006055,0.018936,0.028017,0.026061,0.006273,0.004684,...,0.011939,0.00976,0.012534,0.022768,0.007126,0.022065,0.025511,0.020862,0.008745,0.008818
min,0.020101,0.020598,0.0193,0.008093,0.022657,0.006949,0.0,0.0,0.024254,0.033144,...,0.019589,0.019086,0.018803,0.007406,0.022213,0.003827,0.0,0.0,0.022121,0.026873
25%,0.033689,0.034301,0.03313,0.023705,0.037183,0.023276,0.010389,0.013641,0.037044,0.038278,...,0.032138,0.033473,0.031373,0.020606,0.036392,0.020643,0.014001,0.020607,0.035394,0.03489
50%,0.038497,0.039965,0.038009,0.031082,0.041272,0.033212,0.021628,0.0225,0.041002,0.040826,...,0.03698,0.040348,0.036427,0.027452,0.04098,0.029717,0.027721,0.031715,0.039888,0.039079
75%,0.045437,0.046244,0.04588,0.044144,0.045332,0.046754,0.045933,0.0497,0.044778,0.043864,...,0.046416,0.047192,0.046774,0.043348,0.045568,0.047555,0.046822,0.051225,0.044935,0.044957
max,0.080939,0.083325,0.083078,0.141055,0.070344,0.12384,0.149994,0.135132,0.069557,0.064642,...,0.089028,0.078664,0.093697,0.170113,0.069475,0.148372,0.153097,0.092356,0.093827,0.10131


In [10]:
np.random.seed(RANDOM_SEED)
train_flag = np.random.rand(X.shape[0]) < TRAIN_PERCENT

X_train = X[train_flag]
y_train = y[train_flag]
X_test = X[~train_flag]
y_test = y[~train_flag]

print('X_train.shape: {}'.format(X_train.shape))
print('y_train.shape: {}'.format(y_train.shape))
print('X_test.shape: {}'.format(X_test.shape))
print('y_test.shape: {}'.format(y_test.shape))

X_train.shape: (392, 30)
y_train.shape: (392,)
X_test.shape: (177, 30)
y_test.shape: (177,)


## 4. Implementation from scratch

In [11]:
"""Logistic regression using function framework."""

def sigmoid(x):
    """Compute the sigmoid of x.

    Args:
      x: A scalar or numpy array of any size.

    Returns:
      s: sigmoid(x).
    """
    s = 1 / (1 + np.exp(-x))    
    return s


def initialize_weights(dim):
    """Initialize weights.

    This function creates a vector of zeros of shape (dim, 1) for w and b to 0.
    
    Args:
      dim: A integer. Size of the w vector (or number of parameters.)
    
    Returns:
      w: A Numpy array. Initialized vector of shape (dim, 1)
      b: A integer. Initialized scalar (corresponds to the bias)
    """
    w = np.zeros(dim).reshape(dim, 1)
    b = 0
    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))
    return w, b


def activation(w, b, X):
    """Activation function using sigmoid function."""
    A = sigmoid(np.dot(X, w) + b)
    return A

def cross_entropy(y, A, m):
    """Cross entropy."""
    cross_entropy = - 1 / m * np.sum(y * np.log(A) + (1 - y) * np.log(1 - A))
    return cross_entropy

def gradient(X, y, A, m):
    """Gradient for weight vector and bias."""
    dw = 1 / m * np.dot(X.T, (A - y))
    db = 1 / m * np.sum(A - y)
    return dw, db

def propagate(w, b, X, y):
    """Forward & backward propagation.

    Implement the cost function and its gradient for the propagation.

    Args:
      w: A Numpy array. Weights of size (num_px * num_px * 3, 1)
      b: A float. Bias.
      X: A Numpy array. Data of size (number of examples, num_px * num_px * 3).
      y: A Numpy array. True "label" vector (containing 0 or 1) 
         of size (number of examplesm, 1).

    Returns:
      cost: A float. Negative log-likelihood cost for logistic regression.
      dw: A Numpy array. Gradient of the loss w.r.t. w, thus same shape as w.
      db: A float. Gradient of the loss w.r.t b, thus same shape as b.
    """
    m = X.shape[0]
    y = y.reshape((m, 1))

    # Forward propagation from X to cost.
    # Compute activation.
    A = activation(w, b, X)
    # Compute cost.
    cost = cross_entropy(y, A, m)
    
    # Backward propagation to find gradient.
    dw, db = gradient(X, y, A, m)
    assert(dw.shape == w.shape)
    assert(db.dtype == float)

    cost = np.squeeze(cost)
    assert(cost.shape == ())

    grads = {"dw": dw,
             "db": db} 

    return grads, cost


def gradient_descent(w, b, X, y, num_iterations, learning_rate, print_cost=True):
    """Optimize using gradient descent.

    This function optimizes w and b by running a gradient descent algorithm.
    That is, write down two steps and iterate through them:
      1. Calculate the cost and the gradient for the current parameters. 
        Use propagate().
      2. Update the parameters using gradient descent rule for w and b.
    
    Args:
      w: A Numpy array. Weights of size (num_px * num_px * 3, 1).
      b: A scalar. Bias.
      X: A Numpy array. Data of shape (number of examples, num_px * num_px * 3).
      y: A Numpy array. True "label" vector (containing 0 if non-cat, 1 if cat), 
        of shape (number of examples, 1)
      num_iterations: A integer. Number of iterations of the optimization loop.
      learning_rate: A scalr. Learning rate of the gradient descent update rule.
      print_cost: A Boolean. Print the loss every 100 steps. Default: True.
    
    Returns:
      params: A dictionary containing the weights w and bias b.
      grads: A dictionary containing the gradients of the weights and bias 
        with respect to the cost function
      costs: A list of all the costs computed during the optimization, 
        this will be used to plot the learning curve.
    """   
    costs = []

    for i in range(num_iterations):
        # Cost and gradient calculation (≈ 1-4 lines of code)
        grads, cost = propagate(w, b, X, y)
        
        # Retrieve derivatives from grads
        dw = grads.get('dw')
        db = grads.get('db')
        
        # Update rule.
        w -= learning_rate * dw
        b -= learning_rate * db
        
        # Record the costs every 200 training examples and print.
        if i % 200 == 0:
            costs.append(cost)
        if print_cost and i % 200 == 0:
            print("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs


def predict(w, b, X):
    """Prediction.

    Predict whether the label is 0 or 1 using learned logistic regression 
    parameters (w, b)
    
    Args:
      w: A Numpy array. Learned weights of size (num_px * num_px * 3, 1).
      b: A scalar. Learned bias.
      X: A Numpy array. New data of size (num_px * num_px * 3, number of examples).
    
    Returns:
      y_pred: A Numpy array containing all predictions (0/1) 
        for the examples in X.
    """
    m = X.shape[0]
    y_pred = np.zeros((m, 1))
    
    # Compute vector "A" predicting the probabilities of a label 1.
    A = activation(w, b, X)
    
    for i in range(A.shape[0]):
        # Convert probabilities a[i] to actual predictions y_pred[i].
        if A[i] > 0.5:
            y_pred[i] = 1
        else:
            y_pred[i] = 0
    
    assert(y_pred.shape == (m, 1))
    
    return y_pred


def accuracy(y_pred, y):
    acc = 1 - np.mean(np.abs(y_pred - y))
    return acc


def logistic_regression(X_train, y_train, X_test, y_test, 
                        num_iterations=2000, learning_rate=0.5, print_cost=True):
    '''Wrap-up function for logistic regression.

    Builds the logistic regression model by calling the function 
    you've implemented previously.
    
    Args:
      X_train: A Numpy. Training set of shape (m_train, num_px * num_px * 3).
      y_train: A Numpy array. Training labels of shape (m_train, 1).
      X_test: A Numpy array. Test set of shape (m_test, num_px * num_px * 3).
      y_test: A Numpy array. Test labels of shape (m_test, 1).
      num_iterations: An integer. Hyperparameter for the number of iterations 
        to optimize the parameters. Default: 2000.
      learning_rate: A scalar. Hyperparameter for the learning rate used 
        in the update rule of optimize(). Default: 0.005.
      print_cost: A Boolean. Print the cost every 100 iterations. Default: True.
    
    Returns:
      d: A dictionary containing information about the model.
    '''    
    # initialize parameters with zeros (≈ 1 line of code)
    w, b = initialize_weights(X_train.shape[1])

    # Gradient descent.
    parameters, grads, costs = gradient_descent(
        w, b, X_train, y_train, 
        num_iterations=num_iterations, learning_rate=learning_rate, 
        print_cost=print_cost)
    
    # Retrieve parameters w and b from dictionary 'parameters'
    w = parameters.get('w')
    b = parameters.get('b')
    
    # Predict test/train set examples (≈ 2 lines of code)
    y_pred_train = predict(w, b, X_train)
    y_pred_test = predict(w, b, X_test)

    # Print train/test Errors
    print('Train accuracy: {} %'
          .format(accuracy(y_pred_train.ravel(), y_train) * 100))
    print('Test accuracy: {} %'
          .format(accuracy(y_pred_test.ravel(), y_test) * 100))
    
    d = {'costs': costs,
         'y_pred_train': y_pred_train, 
         'y_pred_test': y_pred_test, 
         'w': w, 
         'b': b,
         'learning_rate' : learning_rate,
         'num_iterations': num_iterations}
    return d

In [12]:
d = logistic_regression(X_train, y_train, X_test, y_test, 
                        num_iterations=10000, learning_rate=0.75)

Cost after iteration 0: 0.693147
Cost after iteration 200: 0.564298
Cost after iteration 400: 0.499417
Cost after iteration 600: 0.452810
Cost after iteration 800: 0.417814
Cost after iteration 1000: 0.390538
Cost after iteration 1200: 0.368625
Cost after iteration 1400: 0.350578
Cost after iteration 1600: 0.335411
Cost after iteration 1800: 0.322446
Cost after iteration 2000: 0.311204
Cost after iteration 2200: 0.301337
Cost after iteration 2400: 0.292586
Cost after iteration 2600: 0.284755
Cost after iteration 2800: 0.277690
Cost after iteration 3000: 0.271273
Cost after iteration 3200: 0.265409
Cost after iteration 3400: 0.260020
Cost after iteration 3600: 0.255044
Cost after iteration 3800: 0.250430
Cost after iteration 4000: 0.246133
Cost after iteration 4200: 0.242119
Cost after iteration 4400: 0.238355
Cost after iteration 4600: 0.234817
Cost after iteration 4800: 0.231482
Cost after iteration 5000: 0.228329
Cost after iteration 5200: 0.225344
Cost after iteration 5400: 0.222510

## 5. Benchmark with sklearn's LogisticRegression

In [13]:
from sklearn.linear_model import LogisticRegression

# Default parameter setting.
print(LogisticRegression())

logist_reg = LogisticRegression(C=1e10, max_iter=100)
logist_reg.fit(X_train, y_train)
y_pred_train_skl = logist_reg.predict(X_train)
y_pred_test_skl = logist_reg.predict(X_test)

print('Train accuracy: {} %'
      .format(accuracy(y_pred_train_skl, y_train) * 100))
print('Test accuracy: {} %'
      .format(accuracy(y_pred_test_skl, y_test) * 100))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Train accuracy: 100.0 %
Test accuracy: 93.7853107345 %


The sklearn result generaly quite similar with our implementation from scratch: both produce test error which are about 93 %. Nevertheless, the former is somewhat different from ours, due to the facts that it uses regularizations to reduce model complexity (see later) in default and does not apply stochastic gradient descent to solve unknown weights and bias.

## References

In [1]:
# TODO: Add references.