Load libraries

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


The heart disease data set can be used to build a model to classify if someone is at risk for heart disease or not. The target is TenYearCHD. There are a mix of continuous and binary predictors, but they have all been turned into numeric variables.

In [39]:
heart = pd.read_csv("https://richardson.byu.edu/220/heart_disease.csv")
heart

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.10,85.0,85.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4233,1,50,1.0,1,1.0,0.0,0,1,0,313.0,179.0,92.0,25.97,66.0,86.0,1
4234,1,51,3.0,1,43.0,0.0,0,0,0,207.0,126.5,80.0,19.71,65.0,68.0,0
4235,0,48,2.0,1,20.0,,0,0,0,248.0,131.0,72.0,22.00,84.0,86.0,0
4236,0,44,1.0,1,15.0,0.0,0,0,0,210.0,126.5,87.0,19.16,86.0,,0


In [40]:
np.mean(heart.TenYearCHD)

0.1519584709768759

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, auc

heart_clean = heart.dropna(axis=0, how='any')

y = heart_clean.TenYearCHD

# Set of predictors
X = heart_clean.drop(columns = ["TenYearCHD"])


# The logistic regression module cn be used to build
# a logistic regression model
mod = LogisticRegression()
mod.fit(X,y)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The fitting procedure is still gradient descent but target function is a little more volatile than anything we've seen before. We can play around with different optimizers and different optimization settings, but instead let's just standardize the data, which always makes numerical algorithms work better.

In [44]:
# Scale the data
scale_for_X = StandardScaler()
scale_for_X.fit(X)
scaled_X = scale_for_X.transform(X)
scaled_X = pd.DataFrame(scaled_X,columns = X.columns)

# Fit a model. We will use the option penalty = "none", I'll explain that later.
mod = LogisticRegression(penalty = None,max_iter = 1000)
mod.fit(scaled_X,y)
# print the coefficients
coefficients = mod.coef_[0]
coeff_df = pd.DataFrame(coefficients, index=X.columns, columns=['Coefficient'])
print(coeff_df)

                 Coefficient
male                0.275757
age                 0.543173
education          -0.048554
currentSmoker       0.035402
cigsPerDay          0.213691
BPMeds              0.027831
prevalentStroke     0.052406
prevalentHyp        0.108658
diabetes            0.006417
totChol             0.102445
sysBP               0.340104
diaBP              -0.049495
BMI                 0.026846
heartRate          -0.038934
glucose             0.170297


Predict the data.

In [46]:
np.mean(mod.predict(scaled_X))

0.019146608315098467

The score is just a measure of accuracy.

In [47]:
mod.score(scaled_X, y)

0.8564004376367614

The logistic regression predictions are in reality more than just a single True.False or 1/0. They actually give a probability, which in this case is a probability of someone being at risk.

In [48]:
mod.predict_proba(scaled_X)[0:20]

array([[0.96156493, 0.03843507],
       [0.95103568, 0.04896432],
       [0.84749775, 0.15250225],
       [0.64551186, 0.35448814],
       [0.90583299, 0.09416701],
       [0.87567541, 0.12432459],
       [0.80767068, 0.19232932],
       [0.93798715, 0.06201285],
       [0.79823754, 0.20176246],
       [0.76621215, 0.23378785],
       [0.91862967, 0.08137033],
       [0.95597916, 0.04402084],
       [0.82184518, 0.17815482],
       [0.92791523, 0.07208477],
       [0.93782197, 0.06217803],
       [0.83675373, 0.16324627],
       [0.92142732, 0.07857268],
       [0.96762644, 0.03237356],
       [0.93501176, 0.06498824],
       [0.94353346, 0.05646654]])

The AUC is a metric that accounts for these probabilities. Accuracy only cares if it is right or not. AUC will include how confident it was that it was right. In other words, if a raisin is Kecimen, a model that predicts a raisin is Kecimen with probability 0.9 will score higher than a model that predicts Kecimen with a probability of 0.6, even though in accuracy they would both score the same.

In [49]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y,mod.predict_proba(scaled_X)[:,1])

0.7391484946496321

A confusion matrix can show us how many false negatives and false positives we have. It can be helpful in cases where there is a large unbalance.




In [50]:
confusion_matrix(y, mod.predict(scaled_X))

array([[3080,   19],
       [ 506,   51]])

The default cutoff is 0.5, but in many cases, especially where the number of 1's and 0's is highly unbalanced, yoou might want to choose a different cutoff.

In [54]:
confusion_matrix(y,mod.predict_proba(scaled_X)[:,1] > 0.1)

array([[1485, 1614],
       [  86,  471]])

We can use the statsmodels package to do things like find significance of variables.

In [55]:
import statsmodels.api as sm
scaled_X = scaled_X.reset_index(drop=True)
y = y.reset_index(drop=True)

# Add constant to scaled X
scaled_X1 = sm.add_constant(scaled_X)

# Fit the model
mod2 = sm.Logit(y, scaled_X1)
result = mod2.fit()

Optimization terminated successfully.
         Current function value: 0.376668
         Iterations 7


We can get the p-values for the parameters and remove insignificant variables.

In [56]:
result.summary()

0,1,2,3
Dep. Variable:,TenYearCHD,No. Observations:,3656.0
Model:,Logit,Df Residuals:,3640.0
Method:,MLE,Df Model:,15.0
Date:,"Mon, 11 Dec 2023",Pseudo R-squ.:,0.1174
Time:,18:39:41,Log-Likelihood:,-1377.1
converged:,True,LL-Null:,-1560.3
Covariance Type:,nonrobust,LLR p-value:,8.027e-69

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.9925,0.057,-34.908,0.000,-2.104,-1.881
male,0.2758,0.054,5.090,0.000,0.170,0.382
age,0.5432,0.057,9.499,0.000,0.431,0.655
education,-0.0486,0.051,-0.962,0.336,-0.148,0.050
currentSmoker,0.0354,0.078,0.452,0.651,-0.118,0.189
cigsPerDay,0.2137,0.074,2.874,0.004,0.068,0.359
BPMeds,0.0278,0.040,0.692,0.489,-0.051,0.107
prevalentStroke,0.0524,0.037,1.417,0.157,-0.020,0.125
prevalentHyp,0.1087,0.064,1.700,0.089,-0.017,0.234
