<a href="https://colab.research.google.com/github/ayushambhore/Cardiovascular-risk-prediction-ML-classification/blob/main/Cardiovascular_risk_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Cardiovascular Risk Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name -** Ayush Ambhore

# **Project Summary -**

This project focused on the prediction of Coronary Heart Disease (CHD) risk based on a dataset containing demographic, behavioral, and medical history features of patients. The dataset included information on gender, age, smoking status, blood pressure medication, previous stroke history, hypertension, diabetes prevalence, and various medical measurements.

The initial exploratory data analysis revealed interesting insights, such as a slightly higher number of females in the dataset and more non-smokers than smokers. Age was found to be a significant risk factor for CHD, and certain medical measurements like cholesterol levels, blood pressure, BMI, heart rate, and glucose levels showed distinct patterns.

To preprocess the data, missing values were handled by dropping rows with missing values, and outliers were treated using the Interquartile Range (IQR) method for continuous variables. Categorical variables were encoded, and feature manipulation and selection were performed to enhance the model's performance.

The dataset was then split into training and testing sets, and to handle the class imbalance issue, the Synthetic Minority Over-sampling Technique (SMOTE) was applied.

Eight different machine learning models, including Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and XGBoost, were trained and evaluated on the dataset.

The evaluation results demonstrated promising performance for all models, with high accuracy, precision, recall, F1 score, and ROC AUC values on both training and test datasets.

The Logistic Regression model displayed stable performance with no signs of overfitting, making it a reliable choice for the task.

The Decision Tree model exhibited perfect accuracy on the training dataset but slightly worse performance on the test dataset, indicating some overfitting.

The Random Forest model demonstrated robust performance on both datasets but also showed signs of overfitting.

The KNN model achieved perfect accuracy on the training dataset, but its performance on the test dataset was still promising, making it a viable option.

The SVM model demonstrated stable and consistent performance on both datasets, showcasing good generalization capabilities.

The XGBoost model emerged as the best-performing model with high accuracy, precision, recall, F1 score, and ROC AUC values on both datasets. It exhibited strong generalization capabilities and minimal overfitting, making it the most reliable and efficient model for this classification task.

In summary, this project successfully developed and evaluated machine learning models to predict the risk of Coronary Heart Disease. The XGBoost model proved to be the most robust and reliable model, offering accurate and precise predictions. The insights gained from this project could be valuable for medical practitioners and policymakers to identify individuals at higher risk of CHD and implement preventive measures accordingly. However, further research and analysis with a larger dataset could enhance the model's performance and provide more comprehensive insights into CHD risk factors.


# **GitHub Link -**

https://github.com/ayushambhore/Cardiovascular-risk-prediction-ML-classification

# **Problem Statement**


The goal of this project is to develop a machine learning model that can accurately predict the 10-year risk of Coronary Heart Disease (CHD) based on various demographic, behavioral, and medical history features of patients. The dataset contains information on gender, age, smoking status, blood pressure medication, previous stroke history, hypertension, diabetes prevalence, and medical measurements such as cholesterol levels, blood pressure, BMI, heart rate, and glucose levels.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math

import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for modelling and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
url = 'https://raw.githubusercontent.com/ayushambhore/Cardiovascular-risk-prediction-ML-classification/main/data_cardiovascular_risk.csv'
df = pd.read_csv(url)

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f' Row count = {df.shape[0]}\n Column count = {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values
print(df.isnull().sum())

In [None]:
print(df.isnull().sum().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False);

### What did you know about your dataset?

This data refers to the information collected about the functioning and health of the heart. The dataset have 3390 rows and 17 columns. The data have no duplicate rows and 510 missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

Demographic:

* Gender: Indicating the biological sex of the patient (categorical: "Male" or "Female").
* Age: The patient's age, measured in years (continuous variable).
Behavioral:

* Smoking Status: Whether or not the patient is a current smoker (categorical: "Yes" or "No").
* Average Cigarettes Per Day: The average number of cigarettes smoked per day by the patient (continuous variable).

Medical History:

* Blood Pressure Medication: Whether or not the patient is currently taking blood pressure medication (categorical: "Yes" or "No").
* Previous Stroke: Whether or not the patient has previously experienced a stroke (categorical: "Yes" or "No").
* Hypertension Prevalence: Whether or not the patient has been diagnosed with hypertension (categorical: "Yes" or "No").
* Diabetes Prevalence: Whether or not the patient has been diagnosed with diabetes (categorical: "Yes" or "No").

Current Medical Measurements:

* Total Cholesterol Level: The patient's total cholesterol level (continuous variable).
* Systolic Blood Pressure: The patient's systolic blood pressure reading (continuous variable).
* Diastolic Blood Pressure: The patient's diastolic blood pressure reading (continuous variable).
* BMI (Body Mass Index): The patient's calculated body mass index (continuous variable).
* Heart Rate: The patient's heart rate (continuous variable).
* Glucose Level: The patient's glucose level (continuous variable).

Predicted Variable (Target):

* 10-year Risk of Coronary Heart Disease (CHD): Indicating the likelihood of the patient developing coronary heart disease within the next 10 years (binary: "1" for "Yes" and "0" for "No").







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## ***3. EDA***

### 1. Univariate analysis of Categorical Variables

In [None]:
categorical_var = ['sex','is_smoking','education','BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes']

# Create a figure to hold the subplots
plt.figure(figsize=(15, 10))

# Set the title for the entire figure
plt.suptitle('Exploring Categorical Features', fontsize=20, fontweight='bold', y=1.02)

# Iterate through the categorical features and create individual subplots
for i, col in enumerate(categorical_var):
    plt.subplot(3, 3, i+1)  # Create a subplot with 3 rows and 3 columns

    # Use seaborn's countplot to visualize the distribution of the current categorical feature
    sns.countplot(data=df, x=col, palette="tab10")

# Display the plots
plt.tight_layout()  # Adjust the layout to prevent overlapping of subplots
plt.show()

**Observations**

* There are slightly more females than males in the dataset.

* The number of non-smokers is slightly higher than the number of smokers, with each group having around 1600 individuals.

* Approximately 1500 individuals have an education level of 1, while nearly 400 individuals have an education level of 4. However, the specific definitions of these education levels are not provided.

* Over 3000 individuals are not taking medication for blood pressure.

* Only a small number of people in the dataset have a history of stroke.

* Around 1000 individuals in the dataset have been identified as hypertensive.

* A large number, more than 3000 people, do not have diabetes.

### 2. Univariate analysis of Numerical Variables

In [None]:
continuous_var = ['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']

# Create a figure to hold the subplots
plt.figure(figsize=(15, 15))

# Iterate through the continuous variables and create individual subplots
for i, col in enumerate(continuous_var):
    plt.subplot(4, 2, i+1)  # Create a subplot with 3 rows and 3 columns

    # Use seaborn's distplot to visualize the distribution of the current continuous variable
    sns.distplot(df[col], kde=False, color='skyblue', label='Distribution')

    # Calculate mean and median for the current continuous variable
    mean_val = df[col].mean()
    median_val = df[col].median()

    # Add vertical lines to mark mean and median
    plt.axvline(mean_val, color='magenta', linestyle='dashed', linewidth=2, label='Mean')
    plt.axvline(median_val, color='cyan', linestyle='dashed', linewidth=2, label='Median')

    plt.title(col + ' Distribution')
    plt.legend()

# Display the plots
plt.tight_layout()  # Adjust the layout to prevent overlapping of subplots
plt.show()


**Observations**

* Age: The age ranges from 35 to 70 years and follows an almost normal distribution. Most individuals belong to the age group around 40 years.

* Cigarettes Smoked per Day: On average, most people do not smoke (0 cigarettes per day), but a significant number of individuals smoke 20 cigarettes per day.

* Cholesterol: Cholesterol levels range from 100 to 700. However, the majority of individuals have cholesterol levels between 150 and 350.

* Systolic Blood Pressure (BP): Systolic BP ranges mainly from 100 to 200.

* Diastolic Blood Pressure (BP): Diastolic BP ranges mainly from 60 to 120.

* BMI (Body Mass Index): BMI varies from 16 to 40, indicating a range of body weights.

* Heart Rate: Heart rate values are observed between 40 to 110 beats per minute. The most common occurrence is around 75 beats per minute.

* Glucose: Glucose levels vary from 50 to 125. However, there are some extreme values that cannot be ignored, as they may pose a risk of heart disease.

### 3. Bivariate analysis between the dependent variable and continuous independent variables

In [None]:
dependent_var = ['TenYearCHD']

# Create subplots with a 3x3 grid
plt.figure(figsize=(15, 15))
plt.subplots_adjust(hspace=0.5)  # Adjust the vertical spacing between subplots

# Loop through each continuous variable and create a violin plot
for idx, i in enumerate(continuous_var):
    plt.subplot(3, 3, idx + 1)  # Create a subplot with 3 rows and 3 columns

    # Use seaborn's violinplot to visualize the relationship between dependent_var and the current continuous variable
    sns.violinplot(x=dependent_var[0], y=i, data=df, palette={0: "blue", 1: "magenta"})

    # Set labels and title
    plt.ylabel(i)
    plt.xlabel(dependent_var[0])
    plt.title(f"{dependent_var[0]} vs {i}")

# Display the plots
plt.show()


**Observations**

* The data analysis indicates a higher risk of Coronary Heart Disease (CHD) among older patients compared to younger ones. However, concerning other continuous variables, there is no definitive evidence supporting their association with CHD risk.

* The observations highlight the significance of age as a risk factor for CHD, which aligns with existing medical knowledge. Nonetheless, it is crucial to conduct further research and analysis to explore potential relationships between CHD risk and other continuous variables.


Understanding the factors contributing to CHD risk is essential for public health initiatives and individual patient care. Additional investigations can provide valuable insights, leading to better preventive measures and tailored interventions for various risk groups. As the medical landscape evolves, comprehensive data analysis continues to play a vital role in advancing our knowledge of cardiovascular diseases.

### 4. Bivariate analysis between the dependent variable and categorical independent variables

In [None]:
# Calculate the percentage distribution of the dependent variable
percent_distribution = df[dependent_var].value_counts(normalize=True) * 100

# Create subplots with a layout of 4 rows and 2 columns
plt.figure(figsize=(15, 20))
plt.subplots_adjust(hspace=0.5, wspace=0.3)  # Adjust the vertical and horizontal spacing between subplots

# Loop through each categorical variable and create a 100% stacked bar chart
for idx, cat_var in enumerate(categorical_var):
    plt.subplot(4, 2, idx + 1)  # Create a subplot with 4 rows and 2 columns

    # Calculate the percentage distribution of the dependent variable for the current categorical variable
    percent_distribution = df.groupby(cat_var)[dependent_var].value_counts(normalize=True).unstack() * 100

    # Create a stacked bar chart for the dependent variable
    percent_distribution.plot(kind='bar', stacked=True, ax=plt.gca(), width=0.8)

    # Set labels and title
    plt.xlabel(cat_var)
    plt.ylabel('Percentage')
    plt.title(f'Distribution of {dependent_var} by {cat_var}')
    plt.xticks(rotation=0)

    # Annotate the bars with percentages
    for p in plt.gca().patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy()
        plt.gca().annotate(f'{height:.1f}%', (x + width / 2, y + height / 2), ha='center', va='center')

plt.show()


**Observations**

* Male patients face a significantly higher CHD risk (18%) compared to female patients (12%).

* Smoking is associated with a significantly higher CHD risk (16%) compared to non-smoking patients (13%).

* Patients with education level 1, 2, 3, and 4 had CHD percentages of 18%, 11%, 12%, and 14%, respectively.

* Patients taking BP medicines have a significantly higher CHD risk (33%) than those not on medication (14%).

* Patients who experienced a stroke in their life have a substantially higher CHD risk (45%) than other patients (14%).

* Hypertensive patients face a significantly higher CHD risk (23%) than non-hypertensive patients (11%).

* Diabetic patients have a considerably higher risk of CHD (37%) than other patients (14%).

### EDA summary








The dataset exhibits a balanced gender representation, with slightly more females than males. Smoking habits are evenly distributed between smokers and non-smokers. Education levels vary, with a majority at level 1 and fewer at level 4. Many individuals are not taking BP medicines, and stroke history is relatively low. Hypertensive patients and non-diabetics form significant groups. Age follows a normal distribution, with the most common age group around 40 years. Smokers tend to smoke around 20 cigarettes daily. Cholesterol, systolic/diastolic blood pressure, BMI, heart rate, and glucose levels show diverse ranges. Age emerges as a crucial CHD risk factor, while gender, smoking, education levels, BP medicines, stroke history, hypertension, and diabetes are significantly associated with CHD risk.


## ***4. Feature Engineering & Data Pre-processing***

### 1. Handling the NULL values

In the medical dataset we are working with, the values are person-specific and essential for accurate predictions. Therefore, imputing null values using advanced methods could introduce inaccuracies. To ensure data integrity and minimize risks, we have decided on a two-step approach.

Firstly, we will identify features with less than 5% null values and drop the corresponding rows. This step ensures that we retain most of the data while removing only a small portion with missing values.

In [None]:
# Calculate the percentage of missing values in each column of the DataFrame 'df'
missing_percentages = df.isna().sum() / len(df) * 100

# Round the missing_percentages to 2 decimal places
rounded_missing_percentages = missing_percentages.round(2)

# Print the rounded missing percentages for each column
print("Missing value percentages:")
print(rounded_missing_percentages)


In [None]:
df.dropna(subset=['education', 'cigsPerDay', 'BPMeds', 'totChol', 'BMI', 'heartRate'], inplace=True)

Secondly, for the remaining rows, we will perform imputation to fill in the null values. Although this step may introduce some level of uncertainty in the predictions, we believe it will not significantly affect the overall outcome.

In [None]:
# Impute missing values in 'glucose' column with the median value from dataset
median_glucose = df['glucose'].median()
df['glucose'] = df['glucose'].fillna(median_glucose)

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False);

### 2. Handling Outliers

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers_count = ((df < Q1 - 1.5 * IQR) | (df > Q3 + 1.5 * IQR)).sum(axis=0)

outliers_count

We can see that very less collinearity is present so we should treat these .

In [None]:
for col in continuous_var:
    # Step 1: Calculate the lower and upper thresholds for outliers using the Interquartile Range (IQR).
    lower_threshold = df[col].quantile(0.25) - 1.5 * IQR[col]
    upper_threshold = df[col].quantile(0.75) + 1.5 * IQR[col]

    # Step 2: Replace the values below the lower threshold with the lower threshold value itself.
    df[col] = df[col].mask(df[col] < lower_threshold, lower_threshold)

    # Step 3: Replace the values above the upper threshold with the upper threshold value itself.
    df[col] = df[col].mask(df[col] > upper_threshold, upper_threshold)


In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers_count = ((df < Q1 - 1.5 * IQR) | (df > Q3 + 1.5 * IQR)).sum(axis=0)

outliers_count

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the Interquartile Range (IQR) method to handle outliers. This technique is robust, non-parametric, and simple. It calculates the IQR, determines lower and upper thresholds, and replaces potential outliers with the corresponding threshold values using the mask function. By using the IQR method, extreme values are identified and treated, improving data accuracy and analysis reliability.

### 3. Categorical Encoding

In [None]:
# Encode 'is_smoking' column with 'YES' as 1 and 'NO' as 0
df['is_smoking'] = df['is_smoking'].map({'YES': 1, 'NO': 0})

# Encode 'sex' column with 'M' as 1 and 'F' as 0
df['sex'] = df['sex'].map({'M': 1, 'F': 0})

we will perform one-hot encoding on the 'education' feature in the DataFrame. This process will convert the categorical variable 'education' into binary columns for each unique category, and we will then drop the original 'education' feature to avoid redundancy.

In [None]:
# Perform one-hot encoding for the 'education' feature
education_onehot = pd.get_dummies(df['education'], prefix='edu')

# Drop the original 'education' feature from the DataFrame
df = df.drop('education', axis=1)

# Concatenate the one-hot encoded 'education' feature with the rest of the data
df = pd.concat([df, education_onehot], axis=1)

# Display the first three rows of the updated DataFrame
df.head(3)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

This code calculates the 'pulse_pressure' by subtracting 'diaBP' from 'sysBP' and then drops the original 'sysBP' and 'diaBP' columns from the DataFrame.

In [None]:
# Create a new column 'pulse_pressure' by calculating the pulse pressure as the difference between 'sysBP' and 'diaBP'
df['pulse_pressure'] = df['sysBP'] - df['diaBP']

# Drop the 'sysBP' and 'diaBP' columns from the DataFrame
df = df.drop(['sysBP', 'diaBP'], axis=1)

#### 2. Feature Selection

In [None]:
plt.figure(figsize=(20,15))
heatmap = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)

heatmap.set_title('Correlation Heatmap');

In [None]:
# droping is_smoking column due to multi-collinearity
df.drop('is_smoking', axis=1, inplace=True)

### 5. Data Transformation

In [None]:
continuous_var = ['age','cigsPerDay','totChol','BMI','heartRate','glucose']
# skewness along the index axis
(df[continuous_var]).skew(axis = 0)

In [None]:
# Skew for log10 transformation
np.log10(df[continuous_var]+1).skew(axis = 0)


The log transformation of the continuous variables has noticeably reduced their skewness.

In [None]:
for col in continuous_var:
  # Applying log transformation to the column values after adding 1 to avoid negative values (since log(0) is undefined).
    df[col] = np.log10(df[col] + 1)

In [None]:
np.log10(df[continuous_var]+1).skew(axis = 0)

### 6. Data Splitting

In [None]:
y = df['TenYearCHD'] #dependent variable

In [None]:
X = df.drop(columns='TenYearCHD') #independent variables

### 7. Handling Imbalanced Dataset

In [None]:
# Dependant Column Value Counts
print(df['TenYearCHD'].value_counts())
print(" ")

# Dependant Variable Column Visualization - Pie Chart
plt.figure(figsize=(6, 6))
df['TenYearCHD'].value_counts().plot(kind='pie', autopct="%1.1f%%", startangle=90)
plt.ylabel('')
plt.title('TenYearCHD Distribution')
plt.show()


##### Do you think the dataset is imbalanced? Explain Why.

Yes

In [None]:
# Display class distribution before handling imbalance
class_distribution_before = Counter(y)
print(f'Before Handling Imbalanced class: {class_distribution_before}')

# Resample the minority class using SMOTE
smote = SMOTE(random_state=42)
X, y = smote.fit_resample(X, y)

# Display class distribution after handling imbalance
class_distribution_after = Counter(y)
print(f'After Handling Imbalanced class: {class_distribution_after}')


In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape)
print(X_test.shape)

### 8. Data scaling

In [None]:
# Scaling Data
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## ***5. ML Model Implementation***

### Evaluation metrics

In [None]:
# Define the columns for the evaluation_results DataFrame
columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Train Precision', 'Test Precision',
           'Train Recall', 'Test Recall', 'Train F1 Score', 'Test F1 Score',
           'Train ROC AUC', 'Test ROC AUC']

# Create an empty DataFrame to store the evaluation results of different models
evaluation_results = pd.DataFrame(columns=columns)

# Define a function to evaluate a model and store its performance metrics in the evaluation_results DataFrame
def evaluate_model(y_train, y_train_pred, y_test, y_test_pred, model_name):
    global evaluation_results  # Use the global evaluation_results DataFrame

    # Calculate various evaluation metrics for the training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    train_precision = precision_score(y_train, y_train_pred)
    test_precision = precision_score(y_test, y_test_pred)
    train_recall = recall_score(y_train, y_train_pred)
    test_recall = recall_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    train_roc_auc = roc_auc_score(y_train, y_train_pred)
    test_roc_auc = roc_auc_score(y_test, y_test_pred)

    # Create a DataFrame to store the evaluation metrics for the current model
    model_evaluation = pd.DataFrame({
        'Model': [model_name],
        'Train Accuracy': [train_accuracy],
        'Test Accuracy': [test_accuracy],
        'Train Precision': [train_precision],
        'Test Precision': [test_precision],
        'Train Recall': [train_recall],
        'Test Recall': [test_recall],
        'Train F1 Score': [train_f1],
        'Test F1 Score': [test_f1],
        'Train ROC AUC': [train_roc_auc],
        'Test ROC AUC': [test_roc_auc]
    })

    # Concatenate the model_evaluation DataFrame with the global evaluation_results DataFrame
    # This step appends the results of the current model to the overall evaluation_results DataFrame
    evaluation_results = pd.concat([evaluation_results, model_evaluation], ignore_index=True)

    # Print the evaluation metrics for the current model
    print(f"Evaluation Metrics for {model_name}:")
    print(f"Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    print(f"Train Recall: {train_recall:.4f}")
    print(f"Test Recall: {test_recall:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}")
    print(f"Test F1 Score: {test_f1:.4f}")
    print(f"Train ROC AUC: {train_roc_auc:.4f}")
    print(f"Test ROC AUC: {test_roc_auc:.4f}")

In [None]:
# Define a function to plot the confusion matrix
def plot_confusion_matrix(y_true, y_pred, labels):
    # Calculate the confusion matrix using sklearn's confusion_matrix function
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    # Create a DataFrame to visualize the confusion matrix with row and column labels
    cm_df = pd.DataFrame(cm, index=labels, columns=labels)

    # Set the size of the plot
    plt.figure(figsize=(4, 3))  # You can adjust the figsize to make the plot smaller or larger

    # Create a heatmap using seaborn to visualize the confusion matrix
    sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', annot_kws={"size": 8})
    # The 'annot=True' parameter adds the numerical values to the heatmap.
    # The 'fmt='d'' parameter formats the values as integers in the heatmap.
    # The 'cmap='Blues'' parameter sets the color map to shades of blue.
    # The 'annot_kws={"size": 8}' parameter adjusts the font size for the annotations.

    # Set the x-axis label and its font size
    plt.xlabel('Predicted Labels', fontsize=10)

    # Set the y-axis label and its font size
    plt.ylabel('True Labels', fontsize=10)

    # Set the title of the plot and its font size
    plt.title('Confusion Matrix', fontsize=12)

    # Display the plot
    plt.show()

### Model-1 : **Logistic regression**

In [None]:
# Fitting model
logr_model = LogisticRegression()
# training the model
logr_model.fit(X_train, y_train)

In [None]:
# Train predictions
logr_train_pred = logr_model.predict(X_train)

In [None]:
# training set recall
logr_train_recall = recall_score(y_train,logr_train_pred)
logr_train_recall

In [None]:
# Test predictions
logr_test_pred = logr_model.predict(X_test)

In [None]:
# Test recall
logr_test_recall = recall_score(y_test,logr_test_pred)
logr_test_recall

In [None]:
evaluate_model(y_train,logr_train_pred,y_test,logr_test_pred,'Logistic regression')

In [None]:
plot_confusion_matrix(y_test,logr_test_pred,[True,False])

* The Logistic Regression model achieved an 80.6% training accuracy and 79.9% test accuracy.
* It demonstrated good precision with 86.3% for training and 84.8% for testing, indicating its ability to correctly identify positive cases.
* The recall values were 72.7% for training and 72.9% for testing, suggesting that it captures a reasonable portion of actual positive cases.
* The model's F1 scores were 78.9% for training and 78.4% for testing, which represent a balance between precision and recall.
* Its ROC AUC scores were 80.6% for training and 79.9% for testing, highlighting its ability to discriminate between positive and negative classes.

### ML model 2: **Decision Tree**

In [None]:
# Fitting model
DT_model = DecisionTreeClassifier()
# training the model
DT_model.fit(X_train, y_train)

In [None]:
# Train predictions
DT_train_pred = DT_model.predict(X_train)

In [None]:
# training set recall
DT_train_recall = recall_score(y_train,DT_train_pred)
DT_train_recall

In [None]:
# Test predictions
DT_test_pred = DT_model.predict(X_test)

In [None]:
# Test recall
DT_test_recall = recall_score(y_test,DT_test_pred)
DT_test_recall

In [None]:
evaluate_model(y_train,DT_train_pred,y_test,DT_test_pred,'Decision Tree')

In [None]:
plot_confusion_matrix(y_test,DT_test_pred,[True,False])

* The Decision Tree model demonstrated perfect training accuracy of 100% and good testing accuracy of 82.6%.
* It achieved perfect precision of 100% on the training set and 81.5% on the testing set, indicating its ability to correctly classify positive instances.
* The model showed perfect recall of 100% in training and 84.3% in testing, indicating that it effectively identifies true positive cases.
* With a perfect F1 score of 100% in training and 82.9% in testing, it strikes a balance between precision and recall.
* The model displayed excellent discriminative power with perfect ROC AUC of 100% in training and 82.6% in testing.

### Model 3: Random Forest

In [None]:
# Create a RandomForestClassifier object
random_forest = RandomForestClassifier()

# Define the grid of hyperparameters to search
param_grid = {
    'n_estimators': [50, 80,  100],  # Number of trees in the ensemble
    'max_features': ["log2", "sqrt"],  # Maximum number of features considered when splitting a node
    'max_depth': [10, 15, 20],  # Maximum number of levels allowed in each tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples necessary in a node to cause node splitting
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required in a leaf node
}

# Create a GridSearchCV object
random_forest_cv = GridSearchCV(random_forest, param_grid=param_grid, scoring='roc_auc', cv=5)

# Fit the GridSearchCV object to the training dataset
random_forest_cv.fit(X_train, y_train)


In [None]:
RF_test_pred = random_forest_cv.predict(X_test)
RF_train_pred = random_forest_cv.predict(X_train)

In [None]:
evaluate_model(y_train,RF_train_pred,y_test,RF_test_pred,'Random Forest')

In [None]:
plot_confusion_matrix(y_test,RF_test_pred,[True,False])

* The Random Forest model achieved perfect training accuracy of 100% and high testing accuracy of 87.9%, indicating excellent generalization capability.
* It showed perfect precision of 100% in training and 90.5% in testing, suggesting its ability to accurately classify positive instances.
* The model achieved perfect recall of 100% in training and 84.7% in testing, indicating it effectively identifies true positive cases.
* With a perfect F1 score of 100% in training and 87.5% in testing, it successfully balances precision and recall.
* The model's excellent discriminative power is evident with a perfect ROC AUC of 100% in training and 87.9% in testing.

### Model 4: KNN (K-Nearest Neighbours)

In [None]:
# Create a KNeighborsClassifier object
knn = KNeighborsClassifier()

# Define the grid of hyperparameters to search
param_grid = {
    'n_neighbors': [3, 5, 7],  # Number of neighbors to use in the classification
    'weights': ['uniform', 'distance'],  # Weight function used in prediction
    'p': [1, 2]  # Power parameter for the Minkowski metric (1 for Manhattan distance, 2 for Euclidean distance)
}

# Create a GridSearchCV object
knn_cv = GridSearchCV(knn, param_grid=param_grid, scoring='roc_auc', cv=5)

# Fit the GridSearchCV object to the training dataset
knn_cv.fit(X_train, y_train)


In [None]:
KNN_test_pred = knn_cv.predict(X_test)
KNN_train_pred = knn_cv.predict(X_train)

In [None]:
evaluate_model(y_train,KNN_train_pred,y_test,KNN_test_pred,'KNN')

In [None]:
plot_confusion_matrix(y_test,KNN_test_pred,[True,False])


* The K-Nearest Neighbors (KNN) model achieved perfect training accuracy of 100% and a high testing accuracy of 87.5%, indicating good generalization performance.
* It showed perfect precision of 100% in training and 85.2% in testing, suggesting its ability to accurately classify positive instances.
* The model achieved perfect recall of 100% in training and 90.6% in testing, indicating its effectiveness in identifying true positive cases.
* With a perfect F1 score of 100% in training and 87.8% in testing, it successfully balances precision and recall.
* The model's ROC AUC score was excellent, being 100% in training and 87.5% in testing, indicating strong discriminative power.

### Model 5: **SVM (support vector machine)**

In [None]:
# Create a Support Vector Machine object
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)

# Train the SVM model on the training data
svm_model.fit(X_train, y_train)


In [None]:
SVM_test_pred = svm_model.predict(X_test)
SVM_train_pred = svm_model.predict(X_train)

In [None]:
evaluate_model(y_train,SVM_train_pred,y_test,SVM_test_pred,'SVM')

In [None]:
plot_confusion_matrix(y_test,SVM_test_pred,[True,False])


* The Support Vector Machine (SVM) model demonstrated an accuracy of 85.7% in training and 82.9% in testing, indicating decent generalization performance.
* SVM exhibited good precision of 91.7% in training and 88.4% in testing, suggesting its ability to correctly classify positive instances.
* The model's recall was 78.4% in training and 75.7% in testing, indicating its capability to identify true positive cases.
* SVM achieved a balanced F1 score of 84.5% in training and 81.5% in testing, effectively combining precision and recall metrics.
* The ROC AUC score was 85.7% in training and 82.9% in testing, indicating satisfactory discriminative power.

### Model 6: **XG Boost**

In [None]:
# Create an XGBoostClassifier object
xgb_model = xgb.XGBClassifier()

# Define the grid of hyperparameters to search
param_grid = param_grid = {
    'n_estimators': [100, 200],  # Reduce to two values
    'learning_rate': [0.1],      # Keep only one value
    'max_depth': [3, 4],         # Reduce to two values
    'subsample': [0.8, 0.9],     # Reduce to two values
    'min_child_weight': [1, 2]   # Reduce to two values
}

# Create a GridSearchCV object for XGBoost
xgb_cv = GridSearchCV(xgb_model, param_grid=param_grid, scoring='roc_auc', cv=5)

# Fit the GridSearchCV object to the training dataset
xgb_cv.fit(X_train, y_train)


In [None]:
xgb_test_pred = xgb_cv.predict(X_test)
xgb_train_pred = xgb_cv.predict(X_train)

In [None]:
evaluate_model(y_train,xgb_train_pred,y_test,xgb_test_pred,'XG boost')

In [None]:
plot_confusion_matrix(y_test,xgb_test_pred,[True,False])

* The XGBoost model exhibited exceptional performance with a training accuracy of 94.5% and a testing accuracy of 88.0%, showcasing its ability to generalize well.
* XGBoost demonstrated excellent precision of 98.5% in training and 91.4% in testing, indicating its proficiency in correctly classifying positive instances.
* The model's recall was 90.5% in training and 83.9% in testing, suggesting its capability to identify true positive cases.
* XGBoost achieved a balanced F1 score of 94.3% in training and 87.5% in testing, effectively combining precision and recall metrics.
* The ROC AUC score was 94.5% in training and 88.0% in testing, indicating its strong discriminative power.

### **Evaluation Results**

In [None]:
evaluation_results

Based on the evaluation results, we can observe that all models perform reasonably well, achieving high accuracy scores on both the training and test datasets. The models have also demonstrated good precision, recall, and F1 scores, indicating a balanced performance between positive and negative classes. However, certain differences stand out among the models.

* Logistic Regression: This model exhibits stable performance, with accuracy, precision, recall, and F1 scores consistently high across both training and test datasets. It seems to generalize well and avoid overfitting.

* Decision Tree: Although achieving perfect accuracy on the training dataset, the model shows some signs of overfitting as it performs slightly worse on the test dataset. Nevertheless, it still provides good predictive power.

* Random Forest: Similar to the Decision Tree, the Random Forest model displays overfitting tendencies. However, it delivers a robust performance on both datasets with high accuracy and precision.

* K-Nearest Neighbors (KNN): The KNN model shows signs of overfitting due to its perfect accuracy on the training dataset. Yet, it performs quite well on the test dataset, making it a promising choice.

* Support Vector Machine (SVM): The SVM model demonstrates stable performance and generalization, with accuracy, precision, recall, and F1 scores at consistent levels between training and test datasets.

* XGBoost: The XGBoost model showcases excellent performance, achieving high accuracy and precision. It demonstrates strong generalization capabilities with minimal overfitting.

Considering the overall performance, XGBoost emerges as the best-performing model due to its high accuracy, precision, recall, F1 score, and ROC AUC on both training and test datasets. While other models also exhibit strong performances, XGBoost stands out as the most robust and reliable choice for this classification task.

# **Conclusion**

1. Dataset Exploration: The dataset provided valuable information on various demographic, behavioral, and medical history factors related to heart health, enabling us to gain insights into CHD risk factors.

2. Data Preprocessing: We successfully handled missing values, outliers, and encoded categorical variables to prepare the data for model training.

3. Model Selection: Several machine learning models, including Logistic Regression, Decision Tree, Random Forest, KNN, SVM, and XGBoost, were evaluated to predict CHD risk.

4. Model Evaluation: The XGBoost model demonstrated superior performance with high accuracy, precision, recall, F1 score, and ROC AUC on both training and test datasets.

5. Interpretability: SHAP analysis provided interpretable insights into feature importance, enhancing our understanding of the impact of different factors on CHD risk.

6. Practical Implications: The developed XGBoost model can aid medical practitioners in identifying patients at higher risk of CHD and tailoring preventive interventions.

7. Public Health Significance: This project contributes to advancing knowledge on cardiovascular diseases and supports public health strategies to combat CHD.

8. Future Directions: Further research could explore additional features and incorporate longitudinal data to enhance predictive capabilities and identify potential new risk factors for CHD.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***