# <span style="color:Red;"> Project Name - Cardiovascular Risk Prediction </span>

The dataset is from an ongoing cardiovascular study on residents of the town of Framingham,
Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of
future coronary heart disease (CHD). The dataset provides the patients’ information. It includes
over 4,000 records and 15 attributes.
Variables
Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk
factors.

##### Project Type   - Classification Algorithm 
##### Contribution   - Individual

In [1]:
from IPython.display import Image
Image("Heart.jpeg") 

<IPython.core.display.Image object>

# **Data Description**

Demographic:

• Sex: male or female("M" or "F")

• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous) Behavioral

• is_smoking: whether or not the patient is a current smoker ("YES" or "NO")

• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical( history)

• BP Meds: whether or not the patient was on blood pressure medication (Nominal)

• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

• Diabetes: whether or not the patient had diabetes (Nominal)
Medical(current)

• Tot Chol: total cholesterol level (Continuous)

• Sys BP: systolic blood pressure (Continuous)

• Dia BP: diastolic blood pressure (Continuous)

• BMI: Body Mass Index (Continuous)

• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

• Glucose: glucose level (Continuous) Predict variable (desired target)

• 10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”) -DV

In [2]:
# libraries used in this project 

import pandas as pd # Pandas is a library used for data analysis and manipulation.
import numpy as np  # NumPy is a library for numerical computing with Python It provides powerful array and matrix operations linear algebra functions, random number generation, and other mathematical tools. 
import seaborn as sns # Seaborn is a library for data visualization built on top of Matplotlib. It provides a high-level interface for creating statistical graphics.
import plotly.express as px # Plotly is a library for creating interactive visualizations in Python. Plotly visualizations can be rendered in web browsers, making them ideal for sharing data insights with others.
import matplotlib.pyplot as plt #For Visualizations 
from sklearn.preprocessing import StandardScaler,MinMaxScaler #for Scaling feature data 
from sklearn.impute import KNNImputer,SimpleImputer #For Outlier Handling 

# Importing libraries for modelling and evaluation 

from sklearn.tree import DecisionTreeClassifier  # # Decision Tree Classifier for classification tasks
from sklearn.linear_model import LogisticRegression  # # Logistic Regression for binary classification
from sklearn.neighbors import KNeighborsClassifier  # # K-Nearest Neighbors for simple, instance-based learning
from sklearn.svm import SVC  # # Support Vector Classifier for high-dimensional classification
from xgboost import XGBClassifier  # # XGBoost Classifier for scalable and accurate gradient boosting
from sklearn.ensemble import RandomForestClassifier  # # Random Forest Classifier for ensemble learning and robustness
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, roc_auc_score  # # Metrics for evaluating model performance

# Importing libraries for hyperparameter tuning 
from sklearn.model_selection import RandomizedSearchCV  # # Randomized Search for hyperparameter optimization
from sklearn.model_selection import GridSearchCV  # # Grid Search for hyperparameter tuning
from sklearn.model_selection import train_test_split  # # Splitting data into training and test sets
from sklearn.feature_selection import mutual_info_classif  # # Feature selection using mutual information 
from imblearn.over_sampling import SMOTE  # # SMOTE for handling class imbalance via oversampling 
import warnings  # # Warnings to suppress unnecessary runtime warnings 
warnings.filterwarnings('ignore')  # # Ignore warnings for cleaner output

In [3]:
# Loading the dataset to the dataframe named "data_df"
data_df=pd.read_csv("data_cardiovascular_risk.csv")

## <span style ="color:red;"> Data Preprocessing </span>

In [4]:
data_df.head() #How does the data look like ? with the heading of the column names 

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


In [5]:
data_df=data_df.drop(columns=["id"]) #"id" feature has no reference in this analysis so dropping it from further analysis 

In [6]:
data_df

Unnamed: 0,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3385,60,1.0,F,NO,0.0,0.0,0,0,0,261.0,123.5,79.0,29.28,70.0,103.0,0
3386,46,1.0,F,NO,0.0,0.0,0,0,0,199.0,102.0,56.0,21.96,80.0,84.0,0
3387,44,3.0,M,YES,3.0,0.0,0,1,0,352.0,164.0,119.0,28.92,73.0,72.0,1
3388,60,1.0,M,NO,0.0,,0,1,0,191.0,167.0,105.0,23.01,80.0,85.0,0


In [7]:
#Checking the dataset info for total data, columns and datatypes 
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              3390 non-null   int64  
 1   education        3303 non-null   float64
 2   sex              3390 non-null   object 
 3   is_smoking       3390 non-null   object 
 4   cigsPerDay       3368 non-null   float64
 5   BPMeds           3346 non-null   float64
 6   prevalentStroke  3390 non-null   int64  
 7   prevalentHyp     3390 non-null   int64  
 8   diabetes         3390 non-null   int64  
 9   totChol          3352 non-null   float64
 10  sysBP            3390 non-null   float64
 11  diaBP            3390 non-null   float64
 12  BMI              3376 non-null   float64
 13  heartRate        3389 non-null   float64
 14  glucose          3086 non-null   float64
 15  TenYearCHD       3390 non-null   int64  
dtypes: float64(9), int64(5), object(2)
memory usage: 423.9+ KB


The dataset contains 7 categorical features namely 'education', 'sex', 'is_smoking', 'BPMeds', 'prevalentStroke', 'prevalentHyp' & 'diabetes'.

The dataset contains 8 numerical features namely 'age', 'totChol', 'cigsPerDay', 'sysBP', 'diaBP', 'BMI', 'heartRate' & 'glucose'.

Here the target feature is 'TenYearCHD'

In [8]:
# Creating two variable for categorical and numerical feature for future analysis 
cat_features =["education","sex","is_smoking","BPMeds","prevalentStroke","prevalenthyp","diabetes"]
num_features =["age","totChol","cigsPerDay","sysBP","diaBP","BMI","heartRate","glucose"]

In [9]:
#Running describe function for numeric variable and using transpose for format 
data_df[num_features].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,3390.0,49.542183,8.592878,32.0,42.0,49.0,56.0,70.0
totChol,3352.0,237.074284,45.24743,107.0,206.0,234.0,264.0,696.0
cigsPerDay,3368.0,9.069477,11.879078,0.0,0.0,0.0,20.0,70.0
sysBP,3390.0,132.60118,22.29203,83.5,117.0,128.5,144.0,295.0
diaBP,3390.0,82.883038,12.023581,48.0,74.5,82.0,90.0,142.5
BMI,3376.0,25.794964,4.115449,15.96,23.02,25.38,28.04,56.8
heartRate,3389.0,75.977279,11.971868,45.0,68.0,75.0,83.0,143.0
glucose,3086.0,82.08652,24.244753,40.0,71.0,78.0,87.0,394.0


## <span style = "color:red;"> Encoding Categorical Variable to numerical format </span>

In [10]:
#Label encoding categorical features for further analysis 
data_df["sex"]=np.where(data_df["sex"]=="M",1,0)
data_df["is_smoking"]=np.where(data_df["is_smoking"]=="YES",1,0)

## <span style ="color:red ;"> Null values handiling </span>

In [12]:
#checking for null value cound in the data
data_df.isnull().sum()

age                  0
education           87
sex                  0
is_smoking           0
cigsPerDay          22
BPMeds              44
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             38
sysBP                0
diaBP                0
BMI                 14
heartRate            1
glucose            304
TenYearCHD           0
dtype: int64

In [None]:
# implementig simple imputer on categorical features 