Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

# Classification Using Binary Logistic Regression

Many supervised learning ML problems involve learning to predict a target variable's labels that assume a finite number of discrete values.

There are many different ML algorithms used for classification.  They include:

* logistic regression (binomial, multinominal)
* support vector machine (SVM)
* Ridge, Lasso, elasticNet
* CART (also for regression)
* AdaBoost (a "boosted" ensemble classifier, also for regression)
* RandomForest (an ensemble method that can also do regression and survival models)
* Neural networks of various sorts

We're going to start our exploration of classifiers with the simple case of predicting a target's labels that have only two values.  We'll use the famous WI breast cancer data set.  It's not large, but it's large enough for our present purposes.  It's also not that easy for classifiers to perform well on.

# Gradient Descent

The training of classifiers is often done using a _gradient descent_ method.  The gradient is the set of partial derivatives of a cost function to be minimized w.r.t. ("with respect to") the parameters to be estimated during training. These parameters are often referred to as _weights_, or _coefficients_.  The cost function is minimized by iterative evaluation of the gradient evaluated the the current values of the parameters, and adjusting the parameters by adjusting them using a specified "learning rate."

[Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) ("SGD") is a gradient descent method that uses a randomly selected datapoint to evaluate the gradient at particular values of the parameters being learned.


## Gradient for a Simple Model

For example, assuming an L2 loss function, the gradient for a conventional _regression_ model could look like 

\begin{align}
\large
\frac{\partial}{\partial w} Loss(W) =  \frac{\partial}{\partial w}\mid y - h_w (x)\;   \mid^2 \\
\end{align}

where:

y is a vector of target, or dependent variable, values;    
W is a vector of weights to be estimated(learned);   
h is some activation function, possibly a linear identity "transformation";  
h_w(x) is the the product of a vector of weights $h_w$ and input variables (features) __x__;  
y - $h_w$(x) is a vector of errors.

##  Gradient Descent

* A models' **w's** (weights) can be solved for analytically if the model is a standard linear regression model:
    * **y** is a continuous measure;
    * The RHS of the model equation is _linear in its parameters_, e.g. for "P" predictor variables X<sub>p</sub>: 
    
    $\large {w_0+w_1 * X_1+w_2 * X_2 +...w_P*X_P}$ 
      
  
* L2 Loss is a quadratic function of the **w<sub>p</sub>'s**.
* For pretty much all other model forms, a closed form analytical solution isn't available, and so _interative_ use of the gradient is what's done. In the simple, one **w** case,  
    * Start with an initial value of **w_i**
    * Initialize **w_i**
    * Loop until Loss(**w**) is minimized:  
      
        * $\large {w_i \gets w_i - \alpha  \frac{\partial Loss}{\partial w_i}}$ 
        
where:
$\alpha$ is the step size, or _**learning rate**_ . 

Note that the _negative_ of the Grad is used in order to point in the direction of decreasing Loss.
    


# Cross Entropy Loss 

A common type of cost function that ML classifiers minimize is based on [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) an _Information Theory_ concept.  It provides a way to measure of dissimilarity between the predicted probabilities of target variable class labels, and what the actual class labels are.

The "conventional" definition of cross entropy for sets of "i" predicted and actual _discrete_ events (like target labels) is:

\begin{align}
\large
H(p,q) = - \sum_{i} p_i log(q_i)
\end{align}

As applied for training binary logistic ML algorithms:

$p_i$ is the "true" probability of observation i's class label,  
$q_i$ is the algorithm's predicted label probability.

Assuming that the target variable's labels are 0 and 1, $p_i$ for case i = 1, and 1-$p_i$ = 0. $q_i$ = the predicted probability of 1, and 1-$q_i$ = the predicted probability of 0.

The cost function to be minimized can be calculated as the sum of the cross-entropies across the cases i:

\begin{align}
\large
C(params~to~be~learned) = - \frac {1}{N} \sum_{i=1}^{N} \Big[ y_i log(\hat {y_i})+(1-y_i)log(1-\hat {y_i}) \Big]
\end{align}

where:

$y_i$ is the true class label, 0 or 1, for case i;  
$\hat {y_i}$ is the prediction of the class label for case i, which will be in the range [0,1], a _probability_.

# Loading Modules To Use  

We'll get the data from the scikit-learn dataset collection.

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
from sklearn import linear_model  
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn import preprocessing
from sklearn.base import clone
from sklearn import datasets
from sklearn.metrics import accuracy_score
import pickle
import os

# WI Breast CA Data

We'll get them from skikit-learn's datasets collection.  What we'll be import is a sklearn "Bunch" data thing.  But it behaves like a Python dict.

In [4]:
breastCA=datasets.load_breast_cancer()
type(breastCA)
breastCA.keys()

sklearn.utils.Bunch

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [5]:
# The description of this dataset:
print(breastCA['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [6]:
# feasure names, target_name
breastCA['feature_names']
breastCA['target_names']

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

array(['malignant', 'benign'], dtype='<U9')

The names of the target categories are "malignant" and "benign."  Let's see what the corresponding codes ("labels") are in the "target" variable:

In [7]:
target_values, value_counts = np.unique(breastCA['target'], return_counts=True)
print(target_values, value_counts)

[0 1] [212 357]


We know from what's in this dataset's "DESCR" that there are 212 cases classified as malignant.  So the target is coded 0=malignant, 1=benign.  As we prep this data for training our binary logistic classifier, we'll reverse this coding so that our models are predicting malignancy, target=1.

# Doing CV'ed Binary Logistic Regression

We need to to the "usual" creation of numpy arrays. Then we're going to go about our CV, including doing MinMax rescaling of features within CV folds.

Here's where you can find the documentation on the [Logistic Regression Algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

## Munging the Data

In [8]:
# create the numpy arrays we need:

X=breastCA['data']  # features
y=breastCA['target'] # labels: 0=malignancy, 1=benign
y=1-y                # relabelled: 0=benign, 1=malignancy

## Setting up Stratified KFold CV

In [9]:
# Stratification will create similar proportions of target values in the folds
# The result may be decreased variance

skf = StratifiedKFold(n_splits=20, random_state=99,shuffle=True) 

## Creating a MinMax Scaler and a logit Regression Model Instance

In [10]:
# MinMax scaler

scaler=preprocessing.MinMaxScaler()

#  logistic regression algorithm
 # logreg alg instance; using defaults except explicit spec for solver

logit_clf=linear_model.LogisticRegression(solver='lbfgs')   

## CV

In [11]:

#
cvres=[]  # list to hold fold results
#
for train_ndx, test_ndx in skf.split(X, y):
    clone_clf = clone(logit_clf)
    X_trainS=scaler.fit_transform(X[train_ndx])
    y_train = y[train_ndx]
    X_testS=scaler.fit_transform(X[test_ndx])
    y_test = y[test_ndx]

    foldfit=clone_clf.fit(X_trainS, y_train)

    y_pred_test=foldfit.predict(X_testS)
    y_pred_train=foldfit.predict(X_trainS)
    
    trainAcc=accuracy_score(y_train,y_pred_train)
    testAcc=accuracy_score(y_test,y_pred_test)
    cvres.append({'train_accuracy':trainAcc,'test_accuracy':testAcc})


In [12]:
pd.DataFrame(cvres)[['train_accuracy','test_accuracy']].describe()

Unnamed: 0,train_accuracy,test_accuracy
count,20.0,20.0
mean,0.972343,0.893783
std,0.002586,0.091995
min,0.968519,0.642857
25%,0.97037,0.882184
50%,0.972222,0.928571
75%,0.974074,0.939017
max,0.97963,1.0


Notice anything interesting about the results?

In [13]:
foldfit.coef_

array([[ 1.94682499,  1.62151943,  1.89901363,  1.59851144,  0.5736114 ,
         0.34446353,  1.39732337,  2.06521426,  0.55338493, -0.93793789,
         1.24467273,  0.03424528,  0.96329828,  0.81705987,  0.06979314,
        -0.60236723, -0.25995023,  0.27001301, -0.17996763, -0.62788904,
         2.45716617,  2.18913158,  2.23106309,  1.73189442,  1.42796693,
         0.79268846,  1.32296286,  2.6326004 ,  1.32951989,  0.32321998]])

# Pickling

Let's [_pickle_ ](https://wiki.python.org/moin/UsingPickle) a random split into training and test data, along with _predicted_ labels and label _probabilities_ for the training and test target. We'll use these in another notebook that's about _classifier performance measurement_.

_Pickling_ is a Python method for _serializing_ (creating a nonvolitile version of) Python objects.

We'll put our data and predictions into a dict, that we'll then pickle.

In [14]:
# First, let's get our training and test split 

X_train, X_test, y_train, y_test = \
    train_test_split(X,y,stratify=y,random_state=99,shuffle=True)

In [15]:
# checking to see whether training and test data have similar proportions of responses

unique_ytrain, counts_ytrain = np.unique(y_train, return_counts=True)
unique_ytest, counts_ytest = np.unique(y_test, return_counts=True)

print('y_train label proportions',counts_ytrain/len((y_train)))
print('y_test label proportions',counts_ytest/len((y_test)))


y_train label proportions [0.62676056 0.37323944]
y_test label proportions [0.62937063 0.37062937]


In [16]:
#  logistic regression algorithm
 # logreg alg instance; using defaults except explicit spec for solver, and max iterations=10000

logit_clf=linear_model.LogisticRegression(solver='lbfgs',max_iter=10000)   

In [17]:
# Get predictions

LogRegM=logit_clf.fit(X_train,y_train)  # instantiate model
yTrainPredLabels=LogRegM.predict(X_train)   # pred training labels
yTestPredLabels=LogRegM.predict(X_test)     # pred test labels
yTrainPredProbs=LogRegM.predict_proba(X_train) # pred training probs
yTestPredProbs=LogRegM.predict_proba(X_test)   # pred test probs

In [18]:
# Get a quick look at the training and test accuracies

print('Training Accuracy: {0:1.2f}'.format(accuracy_score(yTrainPredLabels,y_train)))
print('Test Accuracy: {0:1.2f}'.format(accuracy_score(yTestPredLabels,y_test)))

Training Accuracy: 0.96
Test Accuracy: 0.94


## Create a Dictionary of Data to Serialize

The predicted class probabilities are are N x 2 arrays, so we'll save just the col that's for the label of 1.  Then we'll create our dict.  After that, we'll write it to a pickle file.

In [29]:
yTrainPredProbs=yTrainPredProbs
yTestPredProbs=yTestPredProbs

breastCADict = {'X_train':X_train,'y_train':y_train,
               'X_test': X_test, 'y_test':y_test,
               'yTestPredLabels':yTestPredLabels,
               'yTestPredProbs':yTestPredProbs,
               'yTrainPredLabels':yTrainPredLabels,
               'yTrainPredProbs':yTrainPredProbs}
breastCADict.keys()

dict_keys(['X_train', 'y_train', 'X_test', 'y_test', 'yTestPredLabels', 'yTestPredProbs', 'yTrainPredLabels', 'yTrainPredProbs'])

In [32]:
# Pickle it to the pwd!
with open('DATA/ML/rBinLogData2.pkl','wb') as pickleOutFile: # write, binary format
    pickle.dump(breastCADict,pickleOutFile)

Note that another way to serialize several objects is using the [shelve module](https://docs.python.org/3.5/library/shelve.html).  A shelve database is like a dictionary database. Items in it are stored and retreived using keys.

# Adding Some Regularization

This logistic regression algorithm can apply some shrinkage to the weights (coefficients) that it learns when it's trained.  In this scikit-learn implementation, there's a penalty parameter C that when made smaller _increases_ the amount of regularization.  Let's do a grid search on values of this parameter to see if we can find an improved to test set data.

# A UDU 4U:  Grid Search for Improved Logistic Regression Accuracy

Adapt the grid search code in the Ridge Regression notebook to do it.

## The Ridge Regression Notebook with Grid Search: EX-Ridge-v1 

To get you started, here's most of the code for the Ridge grid search.  Note that the regularization parameter, __C__, _increases_ the amount of regularization as it gets _smaller_.  So, if C=0.5, there is _more_ parameter shrinkage than if C=1.0. (In fact, at 1.0, there isn't any.)

# Next Notebook: EX-Classifier-Performance-Measurement-v1