
The model's main purpose is to predict the 10-year risk of Coronary Heart Disease (CHD). CHD is a disease 
of the blood vessels supplying the heart. Heart disease has been the leading cause of death worldwide since 1921. In 2008, 7.3 million people died from CHD. 
The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).
The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.
Variables
Sex: male or female(0=female, 1=male)
Age: Age of the patient
education: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)
Current Smoker: whether or not the patient is a current smoker (0=No, 1=Yes)
Cigs Per Day: the number of cigarettes that the person smoked on average in one day
BP Meds: whether or not the patient was on blood pressure medication 
Prevalent Stroke: whether or not the patient had previously had a stroke (0=No, 1=Yes)
Prevalent Hyp: whether or not the patient was hypertensive (0=No, 1=Yes)
Diabetes: whether or not the patient had diabetes (0=No, 1=Yes)
Tot Chol: total cholesterol level
Sys BP: systolic blood pressure 
Dia BP: diastolic blood pressure
BMI: Body Mass Index
Heart Rate: heart rate
Glucose: glucose level

Prepare model by using Decision Tree, Random Forest, AdaBoost,CatBoost and XGBoost & Calculate Accuracy by confusion matrix, Calculate accuracy score, precision score, recall score, f1 score.


In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
warnings.filterwarnings(action="ignore")

In [3]:
df = pd.read_csv("framingham.csv")
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [None]:
df = df.drop(axis=1)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4135 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4211 non-null   float64
 5   BPMeds           4187 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4190 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4221 non-null   float64
 13  heartRate        4239 non-null   float64
 14  glucose          3852 non-null   float64
 15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB


In [5]:
df.isna().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

In [6]:
df['education'] = df['education'].fillna(df['education'].mean())

In [7]:
df['cigsPerDay'] = df['cigsPerDay'].fillna(df['cigsPerDay'].mean())

In [8]:
df['BPMeds'] = df['BPMeds'].fillna(df['BPMeds'].mean())

In [9]:
df['totChol'] = df['totChol'].fillna(df['totChol'].mean())

In [10]:
df['BMI'] = df['BMI'].fillna(df['BMI'].mean())

In [11]:
df['heartRate'] = df['heartRate'].fillna(df['heartRate'].mean())

In [12]:
df['glucose'] = df['glucose'].fillna(df['glucose'].mean())

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4240 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4240 non-null   float64
 5   BPMeds           4240 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4240 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4240 non-null   float64
 13  heartRate        4240 non-null   float64
 14  glucose          4240 non-null   float64
 15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB


In [14]:
df['TenYearCHD'].value_counts()

TenYearCHD
0    3596
1     644
Name: count, dtype: int64

In [15]:
df.corr()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
male,1.0,-0.029014,0.017188,0.197026,0.316023,-0.052203,-0.00455,0.005853,0.015693,-0.070064,-0.035879,0.058199,0.081705,-0.116913,0.005718,0.088374
age,-0.029014,1.0,-0.164081,-0.213662,-0.192534,0.122036,0.057679,0.306799,0.101314,0.260691,0.394053,0.205586,0.135578,-0.012839,0.116951,0.225408
education,0.017188,-0.164081,1.0,0.018297,0.008197,-0.010689,-0.035139,-0.080753,-0.038214,-0.022993,-0.128126,-0.061362,-0.135876,-0.053603,-0.033837,-0.053571
currentSmoker,0.197026,-0.213662,0.018297,1.0,0.767055,-0.048621,-0.03298,-0.10371,-0.044285,-0.046211,-0.130281,-0.107933,-0.167483,0.062678,-0.054062,0.019448
cigsPerDay,0.316023,-0.192534,0.008197,0.767055,1.0,-0.045847,-0.032711,-0.066444,-0.037086,-0.026182,-0.088523,-0.056473,-0.092888,0.075257,-0.05602,0.057646
BPMeds,-0.052203,0.122036,-0.010689,-0.048621,-0.045847,1.0,0.115008,0.259125,0.051584,0.078973,0.252023,0.192387,0.099586,0.015172,0.048925,0.086805
prevalentStroke,-0.00455,0.057679,-0.035139,-0.03298,-0.032711,0.115008,1.0,0.074791,0.006955,0.000105,0.057,0.045153,0.024856,-0.017674,0.018065,0.061823
prevalentHyp,0.005853,0.306799,-0.080753,-0.10371,-0.066444,0.259125,0.074791,1.0,0.077752,0.162683,0.696656,0.61584,0.300599,0.146777,0.082757,0.177458
diabetes,0.015693,0.101314,-0.038214,-0.044285,-0.037086,0.051584,0.006955,0.077752,1.0,0.040161,0.111265,0.05026,0.086282,0.048986,0.605709,0.097344
totChol,-0.070064,0.260691,-0.022993,-0.046211,-0.026182,0.078973,0.000105,0.162683,0.040161,1.0,0.207436,0.163423,0.115013,0.090678,0.04471,0.081807


In [16]:
df.drop(['education', 'currentSmoker', 'heartRate'], axis=1, inplace=True)

In [17]:
df['TenYearCHD'].value_counts()

TenYearCHD
0    3596
1     644
Name: count, dtype: int64

In [19]:
from imblearn.over_sampling import SMOTE
x=df.drop(['TenYearCHD'], axis=1)
y=df['TenYearCHD']
smote = SMOTE()
x_sampled, y_sampled = smote.fit_resample(x,y)

In [20]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_sampled,y_sampled, train_size=0.7, random_state=123456)


Decision Tree, Random Forest , AdaBoost, CatBoost, XGBoost & 

In [21]:
from sklearn.tree import DecisionTreeClassifier

def create_model_decision_tree_classifier():
    model = DecisionTreeClassifier()
    model.fit(x_train,y_train)
    return model

In [22]:
from sklearn.ensemble import RandomForestClassifier

def create_model_random_forest():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(x_train, y_train)
    return model

In [23]:
from sklearn.ensemble import AdaBoostClassifier

def create_model_adaboost():
    model = AdaBoostClassifier()
    model.fit(x_train,y_train)
    return model

In [24]:
from catboost import CatBoostClassifier

def create_model_catboost():
    model = CatBoostClassifier(logging_level='Silent')
    model.fit(x_train,y_train)
    return model

In [25]:
from xgboost import XGBClassifier

def create_model_xgboost():
    model = XGBClassifier()
    model.fit(x_train,y_train)
    return model

Calculate Accuracy by confusion matrix, Calculate accuracy score, precision score, recall score, f1 score

In [26]:
from sklearn.metrics import confusion_matrix,accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def evaluate_model(model):
    # training accuracry
    y_pred = model.predict(x_train)
    y_true = y_train

    # calculate training accuracy
    training_accuracy = accuracy_score(y_true, y_pred)

    # testing accuracy
    y_pred = model.predict(x_test)
    y_true = y_test

    # get the performance metrics for testing
    
    testing_accuracy = accuracy_score(y_true, y_pred)
    testing_precision = precision_score(y_true, y_pred, average='macro')
    testing_recall = recall_score(y_true, y_pred,average='macro')
    testing_f1 = f1_score(y_true, y_pred,average='macro')

    return training_accuracy, testing_accuracy, testing_precision, testing_recall, testing_f1

In [27]:
models = [
    ("Decision Tree", create_model_decision_tree_classifier(), (2, 0)),
    ("Random Forest", create_model_random_forest(), (2, 1)),
    ("CAT Boost", create_model_catboost(), (3, 0)),
    ("Ada Boost", create_model_adaboost(), (3, 1)),
    ("XG Boost", create_model_xgboost(), (4, 1))
]
models

[('Decision Tree', DecisionTreeClassifier(), (2, 0)),
 ('Random Forest', RandomForestClassifier(), (2, 1)),
 ('CAT Boost', <catboost.core.CatBoostClassifier at 0x215d00d7d40>, (3, 0)),
 ('Ada Boost', AdaBoostClassifier(), (3, 1)),
 ('XG Boost',
  XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynode=None,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric=None, feature_types=None,
                gamma=None, grow_policy=None, importance_type=None,
                interaction_constraints=None, learning_rate=None, max_bin=None,
                max_cat_threshold=None, max_cat_to_onehot=None,
                max_delta_step=None, max_depth=None, max_leaves=None,
                min_child_weight=None, missing=nan, monotone_constraints=None,
                multi_strategy=None, n_estimators=None, n_jobs=None,
                num_parallel_tree=None,

In [28]:
performance_data = []
for model_name, model, position in models:
    training_accuracy, test_accuracy, precision, recall, f1 = evaluate_model(model)
    performance_data.append([
        model_name,
        f"{training_accuracy * 100:0.2f}%", f"{test_accuracy * 100:0.2f}%",
        precision, 
        recall,        
        f1
    ])
    
performance_chart = pd.DataFrame(performance_data, 
            columns=["Model", "Train Accuracy", "Test Accuracy", "Precision", "Recall", "F1"])
performance_chart

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1
0,Decision Tree,100.00%,76.37%,0.764696,0.764123,0.763604
1,Random Forest,100.00%,85.77%,0.857871,0.857915,0.857738
2,CAT Boost,93.60%,87.30%,0.875034,0.872459,0.872712
3,Ada Boost,72.37%,72.34%,0.723426,0.723469,0.72335
4,XG Boost,98.65%,86.01%,0.860029,0.860006,0.860017
