# Heart Disease Prediction using Logistic Regression

Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, the model predicts that the instance belongs to class 1 (presence of heart disease); otherwise, it predicts class 0 (absence of heart disease). This makes it a binary classifier. In this analysis, we will explore the theory behind `Logistic Regression` and use it to predict the presence of heart disease based on clinical and demographic features.

This dataset contains the following features:

### Demographic Factors
* `Age`: Patient's age
* `Sex`: Patient's gender

### Clinical Measurements
* `RestingBP`: Resting blood pressure (mm Hg)
* `Cholesterol`: Cholesterol levels (mg/dL)
* `FastingBS`: Fasting blood sugar (1 = if fasting blood sugar > 120 mg/dL, 0 = otherwise)

### Cardiac Assessments
* `ChestPainType`: Type of chest pain experienced
* `RestingECG`: Results of resting electrocardiogram
* `MaxHR`: Maximum heart rate achieved

### Exercise-Related Indicators
* `ExerciseAngina`: Angina induced by exercise (1 = Yes, 0 = No)
* `Oldpeak`: ST depression induced by exercise relative to rest

### Additional Cardiac Parameters
* `ST_Slope`: Slope of the peak exercise ST segment

### Target Variable
* `HeartDisease`: 0 or 1 indicating absence or presence of heart disease

Through this logistic regression approach, we aim to identify significant predictors and develop a model that could assist medical professionals in early risk assessment of heart disease.

Loading CSV and import

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')
plt.style.use("fivethirtyeight")

# Load the dataset
df=pd.read_csv('dataset/heart.csv')
df.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


EDA Verification

In [3]:
df.isnull().sum()
df.info()
df['HeartDisease'].value_counts(normalize=True)
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
(918, 12)


In [4]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

Encoding Data

In [5]:
df = pd.get_dummies(df, columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], drop_first=True)
df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,True,True,False,False,True,False,False,False,True
1,49,160,180,0,156,1.0,1,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,0,True,True,False,False,False,True,False,False,True
3,48,138,214,0,108,1.5,1,False,False,False,False,True,False,True,True,False
4,54,150,195,0,122,0.0,0,True,False,True,False,True,False,False,False,True


Scale numerical data and split data

In [12]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

# Split the data into features and target
X = df.drop(columns=['HeartDisease'])
y = df['HeartDisease']

#Split the data into Training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Numerical columns
num_columns = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

# Categorical columns
cat_columns = ['Sex_M', 'ChestPainType_ATA', 'ChestPainType_NAP', 'ChestPainType_TA', 
               'RestingECG_Normal', 'RestingECG_ST', 'ExerciseAngina_Y', 'ST_Slope_Flat', 'ST_Slope_Up']

# Create a column transformer to scale numerical features 
ct = make_column_transformer(
    (MinMaxScaler(), num_columns),
    (StandardScaler(), num_columns),
    remainder='passthrough'
)
# Fit and transform the training data, and 
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)





   Age  RestingBP  Cholesterol  FastingBS  MaxHR  Oldpeak  Sex_M  \
0   40        140          289          0    172      0.0   True   
1   49        160          180          0    156      1.0  False   
2   37        130          283          0     98      0.0   True   
3   48        138          214          0    108      1.5  False   
4   54        150          195          0    122      0.0   True   

   ChestPainType_ATA  ChestPainType_NAP  ChestPainType_TA  RestingECG_Normal  \
0               True              False             False               True   
1              False               True             False               True   
2               True              False             False              False   
3              False              False             False               True   
4              False               True             False               True   

   RestingECG_ST  ExerciseAngina_Y  ST_Slope_Flat  ST_Slope_Up  
0          False             False          F

In [7]:
# Print the number of rows in each dataset
print(f"Total rows in the dataset: {len(X)}")
print(f"Rows in training set (X_train): {len(X_train)}")
print(f"Rows in testing set (X_test): {len(X_test)}")
# Calculate the proportion
train_percentage = len(X_train) / len(X) * 100
test_percentage = len(X_test) / len(X) * 100

print(f"Training set percentage: {train_percentage:.2f}%")
print(f"Testing set percentage: {test_percentage:.2f}%")


Total rows in the dataset: 918
Rows in training set (X_train): 642
Rows in testing set (X_test): 276
Training set percentage: 69.93%
Testing set percentage: 30.07%


In [8]:
from sklearn.linear_model import LogisticRegression


model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 86.14%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.864111    0.859155  0.861371    0.861633      0.861456
recall       0.832215    0.886628  0.861371    0.859421      0.861371
f1-score     0.847863    0.872675  0.861371    0.860269      0.861158
support    298.000000  344.000000  0.861371  642.000000    642.000000
_______________________________________________
Confusion Matrix: 
 [[248  50]
 [ 39 305]]

Test Result:
Accuracy Score: 87.68%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.819672    0.922078  0.876812    0.870875      0.880522
recall       0.892857    0.865854  0.876812    0.879355      0.876812
f1-score     0.854701    0.893082  0.876812    0.873891      0.877507
support    112.000000  164.000000  0.876812  276.000000    276.

In [9]:
from sklearn.ensemble import RandomForestClassifier


rf_clf = RandomForestClassifier(n_estimators=1000)
rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0      1  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    298.0  344.0       1.0      642.0         642.0
_______________________________________________
Confusion Matrix: 
 [[298   0]
 [  0 344]]

Test Result:
Accuracy Score: 86.59%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.826087    0.894410  0.865942    0.860248      0.866685
recall       0.848214    0.878049  0.865942    0.863132      0.865942
f1-score     0.837004    0.886154  0.865942    0.861579      0.866209
support    112.000000  164.000000  0.865942  276.000000    276.000000
_______________________________________________

In [10]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.1, max_depth=3, random_state=42)
gb_clf.fit(X_train, y_train)

print_score(gb_clf, X_train, y_train, X_test, y_test, train=True)
print_score(gb_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0      1  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    298.0  344.0       1.0      642.0         642.0
_______________________________________________
Confusion Matrix: 
 [[298   0]
 [  0 344]]

Test Result:
Accuracy Score: 84.78%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.791667    0.891026  0.847826    0.841346      0.850706
recall       0.848214    0.847561  0.847826    0.847888      0.847826
f1-score     0.818966    0.868750  0.847826    0.843858      0.848548
support    112.000000  164.000000  0.847826  276.000000    276.000000
_______________________________________________

In [11]:
from sklearn.svm import SVC

svc_clf = SVC(kernel='linear', C=1, random_state=42)
svc_clf.fit(X_train, y_train)

print_score(svc_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svc_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 86.45%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.872792    0.857939  0.864486    0.865365      0.864833
recall       0.828859    0.895349  0.864486    0.862104      0.864486
f1-score     0.850258    0.876245  0.864486    0.863251      0.864182
support    298.000000  344.000000  0.864486  642.000000    642.000000
_______________________________________________
Confusion Matrix: 
 [[247  51]
 [ 36 308]]

Test Result:
Accuracy Score: 87.32%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.813008    0.921569  0.873188    0.867288      0.877515
recall       0.892857    0.859756  0.873188    0.876307      0.873188
f1-score     0.851064    0.889590  0.873188    0.870327      0.873956
support    112.000000  164.000000  0.873188  276.000000    276.