[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/drbob-richardson/stat220/blob/main/Lecture_Code/11_1_Logistic_Regression.ipynb)

Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


The raisin data set can be used to build a model to classify raisins as either Kecimen or Besni.

In [2]:
raisin = pd.read_csv("https://richardson.byu.edu/220/raisin.csv")
raisin

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251,Kecimen
...,...,...,...,...,...,...,...,...
895,83248,430.077308,247.838695,0.817263,85839,0.668793,1129.072,Besni
896,87350,440.735698,259.293149,0.808629,90899,0.636476,1214.252,Besni
897,99657,431.706981,298.837323,0.721684,106264,0.741099,1292.828,Besni
898,93523,476.344094,254.176054,0.845739,97653,0.658798,1258.548,Besni


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, auc

# The y variable is the target. It must be either 0's and 1's
# or True's and False's. We can easily turn a variable with
# strings in to a True/False
y = raisin.Class == "Kecimen"

# Set of predictors
X = raisin.drop(columns = ["Class"])

# The logistic regression module cn be used to build
# a logistic regression model
mod = LogisticRegression()
mod.fit(X,y)


The fitting procedure is still gradient descent but target function is a little more volatile than anything we've seen before. We can play around with different optimizers and different optimization settings, but instead let's just standardize the data, which always makes numerical algorithms work better.

In [5]:
# Scale the data
scale_for_X = StandardScaler()
scale_for_X.fit(X)
scaled_X = scale_for_X.transform(X)
scaled_X = pd.DataFrame(scaled_X,columns = X.columns)

# Fit a model. We will use the option penalty = "none", I'll explain that later.
mod = LogisticRegression(penalty = "none")
mod.fit(scaled_X,y)
# print the coefficients
mod.coef_



array([[-19.52470455,   5.17099581,   4.55080239,   0.35104722,
         16.65223921,   0.03647395,  -9.88413897]])

Predict the data.

In [6]:
mod.predict(scaled_X)

array([False,  True, False,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True, False,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True, False, False, False,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,

The score is just a measure of accuracy.

In [7]:
mod.score(scaled_X, y)

0.8577777777777778

The logistic regression predictions are in reality more than just a single True.False or 1/0. They actually give a probability, which in this case is a probability of the raisin being Kecimen.

In [8]:
mod.predict_proba(scaled_X)[0:20]

array([[0.70717711, 0.29282289],
       [0.47205672, 0.52794328],
       [0.75368935, 0.24631065],
       [0.04702848, 0.95297152],
       [0.12942462, 0.87057538],
       [0.06989084, 0.93010916],
       [0.06552195, 0.93447805],
       [0.04848009, 0.95151991],
       [0.07251451, 0.92748549],
       [0.10328657, 0.89671343],
       [0.7397053 , 0.2602947 ],
       [0.04162184, 0.95837816],
       [0.02346831, 0.97653169],
       [0.07737765, 0.92262235],
       [0.35246604, 0.64753396],
       [0.03593172, 0.96406828],
       [0.38526903, 0.61473097],
       [0.09550701, 0.90449299],
       [0.03903862, 0.96096138],
       [0.43171944, 0.56828056]])

The AUC is a metric that accounts for these probabilities. Accuracy only cares if it is right or not. AUC will include how confident it was that it was right. In other words, if a raisin is Kecimen, a model that predicts a raisin is Kecimen with probability 0.9 will score higher than a model that predicts Kecimen with a probability of 0.6, even though in accuracy they would both score the same.

In [9]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y,mod.predict_proba(scaled_X)[:,1])

0.9279061728395063

A confusion matrix can show us how many false negatives and false positives we have. It can be helpful in cases where there is a large unbalance.




In [10]:
confusion_matrix(y, mod.predict(scaled_X))

array([[379,  71],
       [ 57, 393]])

The default cutoff is 0.5, but in many cases, especially where the number of 1's and 0's is highly unbalanced, yoou might want to choose a different cutoff.

In [13]:
confusion_matrix(y,mod.predict_proba(scaled_X)[:,1] > 0.5)

array([[379,  71],
       [ 57, 393]])

We can use the statsmodels package to do things like find significance of variables.

In [14]:
import statsmodels.api as sm
scaled_X1 = sm.add_constant(scaled_X)
mod2 = sm.Logit(y, scaled_X1)
result = mod2.fit()

Optimization terminated successfully.
         Current function value: 0.338323
         Iterations 9


We can get the p-values for the parameters and remove insignificant variables.

In [15]:
result.summary()

0,1,2,3
Dep. Variable:,Class,No. Observations:,900.0
Model:,Logit,Df Residuals:,892.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 07 Dec 2023",Pseudo R-squ.:,0.5119
Time:,08:27:29,Log-Likelihood:,-304.49
converged:,True,LL-Null:,-623.83
Covariance Type:,nonrobust,LLR p-value:,1.1329999999999999e-133

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.6897,0.230,-3.004,0.003,-1.140,-0.240
Area,-19.5311,4.851,-4.026,0.000,-29.039,-10.023
MajorAxisLength,5.1708,1.852,2.792,0.005,1.541,8.801
MinorAxisLength,4.5508,1.346,3.381,0.001,1.913,7.189
Eccentricity,0.3512,0.443,0.792,0.428,-0.518,1.220
ConvexArea,16.6605,4.852,3.434,0.001,7.152,26.169
Extent,0.0365,0.145,0.251,0.802,-0.248,0.321
Perimeter,-9.8858,1.810,-5.463,0.000,-13.433,-6.339


Let's remove the penalty = "none" variable. This actually does regularization, a.k.a. Lasso type shrinkage automatically. This will ofte result in a better model.

In [16]:
mod = LogisticRegression()
mod.fit(scaled_X,y)
mod.score(scaled_X, y)

0.8666666666666667

You can adjust the penalization term with C = .... The default is C = 1.

In [17]:
mod = LogisticRegression(C = 5)
mod.fit(scaled_X,y)
mod.score(scaled_X, y)

0.8688888888888889

Best to compare on a holdout set to find a good value for C.

In [18]:
scaled_X_train, scaled_X_test, y_train, y_test = train_test_split(scaled_X,y,test_size=0.3, random_state=1357)


In [26]:
mod = LogisticRegression(penalty = None)
mod.fit(scaled_X_train,y_train)
mod.score(scaled_X_test, y_test)

0.8592592592592593