# Logistic Regressions

Here, we'll see examples of how to use the scikit-learn logistic regression class, as well as the statsmodels GLM function, which is much more similar to R's glm function for doing logistic regression.

You can read about the scikit-learn logistic regression function here:

[http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [2]:
%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# sklearn.metrics has a bunch of really handy evaluation functions
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score
from sklearn import datasets
import seaborn as sns

Let's load the famous iris dataset, which has measured features of different species of iris:

In [11]:
iris = datasets.load_iris()
type(iris)

sklearn.datasets.base.Bunch

In [8]:
iris.keys()

['target_names', 'data', 'target', 'DESCR', 'feature_names']

The three species are coded as 0, 1, 2:

In [6]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

In [7]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The features are length and width measurements of different parts of the iris ([http://irisabramson.com/wp-content/uploads/2014/10/iris_petal_sepal.png](http://irisabramson.com/wp-content/uploads/2014/10/iris_petal_sepal.png)):

In [None]:
iris.feature_names

Let's make a dataset which is only the first two predictors, so that we can visualize the decision boundaries:

In [None]:
X = iris.data[:, :2]
y = iris.target

Let's construct and fit our scikit-learn classifier, which should follow the by-now-familiar workflow of construct, fit, predict that we saw with k-nearest neighbors and linear regression:

In [None]:
# construct a linear regression model with no regularization
logit = LogisticRegression(C=1e5)

In [None]:
logit.fit(X, y)

In [None]:
training_preds = logit.predict(X)
training_preds

In [None]:
training_probs = logit.predict_proba(X)
training_probs

In [None]:
np.where(training_preds!=y)

In [None]:
num = 52
print y[num]
print training_probs[num]

### Evaluating the Classifier Performance

At the very top, we imported several functions from sklearn.metrics ([http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)):

In [None]:
?confusion_matrix

In [None]:
confusion_matrix(y, training_preds)

We can also use the crosstab function in pandas, which has the advantage that it's clear which are rows and which are columns:

In [None]:
pd.crosstab(index=y, columns=training_preds, rownames=['True'], colnames=['Predicted'])

In [None]:
print accuracy_score(y, training_preds)

The `classification_report` function will easily give us some other metrics:

In [None]:
print classification_report(y, training_preds, labels=[0,1,2], target_names=['class 0', 'class 1', 'class 2'])

To make an ROC curve, let's simplify things and train a binary version rather than a multi-class version of the classifier:

In [None]:
y_bin = (y==2).astype("int")

logit_binary = LogisticRegression(C=1e5)
logit_binary.fit(X, y_bin)

bin_preds = logit_binary.predict_proba(X)[:, 1]

The `roc_curve` function returns three arrays.  One for the false positive rate, one for the true positive rate, and one for the probability thresholds that correspond to each point:

In [None]:
fpr, tpr, thresholds = roc_curve(y_bin, bin_preds)

In [None]:
thresholds

In [None]:
# we want to draw the random baseline ROC line too
fpr_rand = tpr_rand = np.linspace(0, 1, 10)

plt.plot(fpr, tpr)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

And we can easily calculate the AUC:

In [None]:
roc_auc_score(y_bin, bin_preds)

We can also plot cumulative gains and lift curves, though we have to calculate them by hand:

In [None]:
# get scores ordered from highest to lowest
order = np.argsort(bin_preds)
# this notation means go from beginning to end by -1, which is reverse order
decreasing_order = order[::-1]

In [None]:
bin_preds[decreasing_order]

In [None]:
total_ones = y_bin.sum()
num_examples = len(y_bin)
percent_ones = float(total_ones)/float(num_examples)
print "We have %s total 1's out of %s training examples." % (total_ones, num_examples)

percent_targeted = np.linspace(0, 1, 100)

rands, cums = [], []
for p in percent_targeted:
    # for random targeting, we just get a constant fraction
    rands.append(p*percent_ones)
    
    # for a real model, we take the p percent highest scorers
    # and see how many ones there are
    n_ones = y_bin[decreasing_order[:int(p*num_examples)]].sum()
    cums.append(float(n_ones)/float(num_examples))
    
# when we're done, calculate lift too
lifts = np.array(cums)/np.array(rands)

In [None]:
plt.plot(percent_targeted, cums)
plt.plot(percent_targeted, rands, linestyle='--')
plt.xlabel('Percent Targeted')
plt.ylabel('Cumulative Gain')
plt.show

In [None]:
plt.plot(percent_targeted, lifts)
plt.plot(percent_targeted, np.ones(percent_targeted.shape), linestyle='--')
plt.ylim(0, 3)
plt.xlabel('Percent Targeted')
plt.ylabel('Lift')
plt.show

### Visualizing the Decision Boundary

Let's visualize what the logistic regression classifier is doing by constructing a fine 2-D mesh in the 2-D feature space and predicting the output at each value:

In [None]:
# step size of the mesh
h = .02
# range of the mesh
x0_min, x0_max = X[:, 0].min() - .5, X[:, 0].max() + .5
x1_min, x1_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx0, xx1 = np.meshgrid(np.arange(x0_min, x0_max, h), np.arange(x1_min, x1_max, h))

In [None]:
# ravel is the same as reshape(-1), which we saw last week
all_preds = logit.predict(np.column_stack((xx0.ravel(), xx1.ravel())))

In [None]:
grid_preds = all_preds.reshape(xx0.shape)

In [None]:
plt.pcolormesh(xx0, xx1, grid_preds, cmap=plt.cm.Paired)

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx0.min(), xx0.max())
plt.ylim(xx1.min(), xx1.max())
plt.xticks(())
plt.yticks(())

plt.show()

Let's see what happens if we add quadratic features and an interaction term:

In [None]:
X_expanded = np.column_stack((X, X**2, X[:, 0]*X[:, 1]))
X_expanded

In [None]:
logit2 = LogisticRegression(C=1e5)
logit2.fit(X_expanded, y)
training_preds_2 = logit2.predict(X_expanded)

x0_flat = xx0.ravel()
x1_flat = xx1.ravel()
stacked = np.column_stack((x0_flat, x1_flat, x0_flat**2, x1_flat**2, x0_flat*x1_flat))

all_preds_2 = logit2.predict(stacked)
grid_preds_2 = all_preds_2.reshape(xx0.shape)

In [None]:
plt.pcolormesh(xx0, xx1, grid_preds_2, cmap=plt.cm.Paired)

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx0.min(), xx0.max())
plt.ylim(xx1.min(), xx1.max())
plt.xticks(())
plt.yticks(())

plt.show()

In terms of training set accuracy, we do a little bit better, but it's unlikely this quadratic decision boundary would be better on an independent training set:

In [None]:
print accuracy_score(y, training_preds)
print accuracy_score(y, training_preds_2)

In [None]:
pd.crosstab(index=y, columns=training_preds, rownames=['True'], colnames=['Predicted'])

In [None]:
pd.crosstab(index=y, columns=training_preds_2, rownames=['True'], colnames=['Predicted'])

## Statsmodels

### Using A Formula to Fit to a Pandas Dataframe

[http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/glm_formula.html](http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/glm_formula.html)

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

The Star98 dataset is an educational dataset from California counties.  The column `NABOVE` represents "the number of 9th graders scoring over the national median value on the mathematics exam."

[http://statsmodels.sourceforge.net/0.6.0/datasets/generated/star98.html](http://statsmodels.sourceforge.net/0.6.0/datasets/generated/star98.html)

In [None]:
star98 = sm.datasets.star98.load_pandas().data

In [None]:
star98.head()

In [None]:
dta = star98[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP',
              'PCTCHRT', 'PCTYRRND', 'PERMINTE', 'AVYRSEXP', 'AVSALK',
              'PERSPENK', 'PTRATIO', 'PCTAF']]
percent_above = dta['NABOVE'] / (dta['NABOVE'] + dta['NBELOW'])

dta = dta.drop(['NABOVE', 'NBELOW'], axis=1, inplace=False)
dta["SUCCESS"] = percent_above>0.5
dta["SUCCESS"] = dta["SUCCESS"].astype("int")
dta.head()

In [None]:
formula = 'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'

In [None]:
mod1 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
mod1.summary()

In [None]:
print(mod1.params)