Heart disease is one of the main cause of death in the world so detecting and predicting it early is important. The method used is logistic regression.

In [26]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

### 2 Data Preparation

The dataset is from an ongoing cardiovascular study on residents of the town Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The datasets includes over 4,000 records and 15 attributes.

In [27]:
disease_df = pd.read_csv("framingham.csv")
disease_df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [28]:
disease_df.drop(columns=['education'], inplace=True, axis=1)    # We do not need the education of the person
disease_df.rename(columns={'male': 'Sex_male'}, inplace=True)   # Change the name of male to Sex_male
disease_df.dropna(axis=0, inplace=True)                         # Remove rows with NaN values from the DataFrame
disease_df.head()


Unnamed: 0,Sex_male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [29]:
"""
Print the count of unique values in the TenYearCHD column which likely indicates whether a patient
has heart disease
"""
print(disease_df.TenYearCHD.value_counts())

TenYearCHD
0    3179
1     572
Name: count, dtype: int64


### 3: Splitting the Dataset into Test and Train Sets

First we scale the data using Standard Scaler. Scales the features to have a mean of 0 and std of 1.

Train set --> 70% of data

Test set --> 30% of data

In [31]:
X = np.asarray(disease_df[['age', 'Sex_male', 'cigsPerDay', 'totChol', 'sysBP', 'glucose']])
y = np.asarray(disease_df['TenYearCHD'])

print(f"X dataset with a shape of: {X.shape}\n {X}")
print(f"\ny (target) with a shape of : {y.shape}\n {y}")
print("\n==========================\n")

# Feature scaling
X = preprocessing.StandardScaler().fit(X).transform(X)
print(f"X set after the feature scaling:\nShape: {X.shape} \n {X}")
print("\n==========================\n")

# Train Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)

print(f"Train set: {X_train.shape}, {y_train.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")

X dataset with a shape of: (3751, 6)
 [[ 39.    1.    0.  195.  106.   77. ]
 [ 46.    0.    0.  250.  121.   76. ]
 [ 48.    1.   20.  245.  127.5  70. ]
 ...
 [ 52.    0.    0.  269.  133.5 107. ]
 [ 40.    1.    0.  185.  141.   72. ]
 [ 39.    0.   30.  196.  133.   80. ]]

y (target) with a shape of : (3751,)
 [0 0 0 ... 0 0 0]


X set after the feature scaling:
Shape: (3751, 6) 
 [[-1.23390951  1.11629198 -0.75552698 -0.93997111 -1.19619549 -0.20436458]
 [-0.4170173  -0.89582297 -0.75552698  0.29305664 -0.51572536 -0.24624229]
 [-0.18361952  1.11629198  0.9218319   0.18096321 -0.22085497 -0.49750858]
 ...
 [ 0.28317603 -0.89582297 -0.75552698  0.71901168  0.05133307  1.05196682]
 [-1.11721063  1.11629198 -0.75552698 -1.16415797  0.39156814 -0.41375315]
 [-1.23390951 -0.89582297  1.76051134 -0.91755243  0.02865074 -0.07873144]]


Train set: (2625, 6), (2625,)
Test set: (1126, 6), (1126,)
