<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Purple;
           font-size:210%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:center;"
          >
       WELCOME TO MY NOTEBOOK
</p>
</div>

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Red;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Risk Factors for Cardiovascular Heart Disease
</p>
</div>

![](https://media.tenor.com/WNgt8c1nsxQAAAAM/waiting-for-an-answer-the-heart.gif)


This dataset offers an opportunity to delve into the factors that contribute to cardiovascular disease risk among adults. The primary objective is to gain insights into how specific demographic variables, health behaviors, and biological indicators influence the development of heart ailments.

To initiate the exploration, it's advisable to peruse the various data columns and acquaint oneself with their meanings. Each field provides distinct information pertaining to heart health:

- **Age**: Represents the age of the participant (integer).
- **Gender**: Indicates the gender of the participant (male/female).
- **Height**: Specifies the participant's height in centimeters (integer).
- **Weight**: Indicates the participant's weight in kilograms (integer).
- **Ap_hi**: Denotes the systolic blood pressure reading obtained from the patient (integer).
- **Ap_lo**: Signifies the diastolic blood pressure reading obtained from the patient (integer).
- **Cholesterol**: Reflects the total cholesterol level measured in mg/dL on a scale ranging from 0 to 5+ units (integer). Each unit signifies an increment or decrement of 20 mg/dL, respectively.
- **Gluc**: Represents the glucose level measured in mmol/l on a scale from 0 to 16+ units (integer). Each unit signifies an increase or decrease of 1 mmol/L, respectively.
- **Smoke**: Indicates whether the individual is a smoker or not (binary; 0 = No, 1 = Yes).
- **Alco**: Indicates whether the individual consumes alcohol or not (binary; 0 = No, 1 = Yes).
- **Active**: Indicates whether the individual engages in regular physical activity or not (binary; 0 = No, 1 = Yes).
- **Cardio**: Indicates whether the individual suffers from cardiovascular diseases or not (binary; 0 = No, 1 = Yes).

In summary, this dataset offers a comprehensive foundation for investigating the intricate interplay between these factors and their impact on cardiovascular health.

In [None]:
# import all the necessary libraraies
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,VotingClassifier
from sklearn.metrics import accuracy_score,roc_auc_score,precision_score,recall_score,f1_score,ConfusionMatrixDisplay,classification_report


import warnings 
warnings.filterwarnings("ignore")

In [None]:
# Read the dataset
dataframe=pd.read_csv("/kaggle/input/exploring-risk-factors-for-cardiovascular-diseas/heart_data.csv")

In [None]:
# check the top 5 rows of data
dataframe.head()

In [None]:
# check the bottom 5 rows of the dataframe
dataframe.tail()

In [None]:
# check the shape of data
dataframe.shape

In [None]:
dataframe.info()

In [None]:
dataframe.describe()

In [None]:
# Lets drop the index and id from the dataset that has no impact on the target variable
dataframe.drop(["index", "id"], axis=1, inplace=True)

In [None]:
dataframe.shape

In [None]:
# Lets make the correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True, cmap="Purples", fmt=".2f")
plt.show()

In [None]:
# To check Is there any null values in the dataset
dataframe.isna().sum()

In [None]:
# To check Is there any duplicate value in the dataset
dataframe.duplicated().sum()

In [None]:
# Lets remove the duplicate values from the dataset
dataframe.drop_duplicates(inplace=True)

# Exploratory Data Anlaysis

# Univariate Analysis

In [None]:
# Lets check the dataset is balanced or not
dataframe["cardio"].value_counts()

# Here we can see that our dataset is balanced

In [None]:
dataframe["gender"].value_counts()

In [None]:
dataframe["smoke"].value_counts()

In [None]:
dataframe["alco"].value_counts()

In [None]:
dataframe["active"].value_counts()

In [None]:
dataframe["cholesterol"].value_counts()

In [None]:
dataframe["gluc"].value_counts()

In [None]:
color_list=["red", "green", "blue","violet","purple","grey","orange"]
col_list=["cardio", "gender", "smoke","alco","active", "cholesterol","gluc"]
for i in range(len(col_list)):
    plt.figure(figsize=(5,5))
    sns.countplot(data=dataframe, x=dataframe[col_list[i]], color=color_list[i])
    plt.title(f"Data Distribution of {col_list[i]}")
    plt.show()

# Let see the distribution of Numerical cols

In [None]:
# The age is in 1000's lets convert it into hundred's
dataframe["age"]=dataframe["age"]/1000
dataframe["age"]

# Bivariate Analysis

In [None]:
# Lets see the data distribution of age 
plt.figure(figsize=(7,7))
sns.histplot(x=dataframe["age"], kde=True, color="red", bins=20)
plt.title("Data Distribution of Age")
plt.show()

# Lets check the correaltion of Cardio with Age 

In [None]:
plt.figure(figsize=(8,8))
sns.histplot(data= dataframe, x=dataframe["age"], hue="cardio", multiple='stack',bins=20)
plt.title("Distribution of Age Vs Cardio")
plt.show()

> # Here we can see that as the age of the person increases the risk of cardiovascular also increases.

In [None]:
# Lets check the outliers in the age column
plt.figure(figsize=(5,5))
sns.boxplot(data= dataframe, x=dataframe["cardio"], y=dataframe["age"])
plt.show()

> # This plot simply tells us that the most of the percentile that have a high risk of cardivascular is above the age of 18.

In [None]:
plt.figure(figsize=(8,8))
sns.histplot(data= dataframe, x=dataframe["weight"], hue="cardio", multiple='stack',kde=True, bins=20)
plt.title("Distribution of Weight Vs Cardio")
plt.show()

> # Here we can see that as the weight of the person increases, the risk of cardiovascular also increases.

In [None]:
plt.figure(figsize=(8,8))
sns.histplot(data= dataframe, x=dataframe["cholesterol"], hue="cardio", multiple='stack')
plt.title("Distribution of Cholesterol Vs Cardio")
plt.show()

> # Here we can notice that as the cholesterol level of the person increases,the risk of cardiovascular also increases.

In [None]:
plt.figure(figsize=(8,8))
sns.histplot(data= dataframe, x=dataframe["gluc"], hue="cardio", multiple='stack')
plt.title("Distribution of Glucose Vs Cardio")
plt.show()

> # Here we can notice that higher the level of Glucose, Higher the chances of Cardiovascular.

# Split the data into Train and Test 

In [None]:
def train_test_split_data(dataframe,target,test_size, random_state):
    x_train,x_test, y_train, y_test= train_test_split(dataframe.drop([target], axis=1),
                                                      dataframe[target],
                                                      test_size=test_size,
                                                      random_state=random_state,
                                                      stratify=dataframe[target]
                                                      )
    
    return x_train,x_test, y_train, y_test

In [None]:
 x_train,x_test, y_train, y_test= train_test_split_data(dataframe,target="cardio",test_size=0.3, random_state=42)

In [None]:
x_train.shape,x_test.shape, y_train.shape, y_test.shape

In [None]:
numerical_cols=["age","weight","height","ap_hi","ap_lo"]

# Lets detect the outliers in the dataset Using Percentile Capping Method

In [None]:
def Winsorization_Method(columns, x_train, y_train , a, b):
    outliers=[]

    for col in columns:
        q1= np.percentile(x_train[col], a)
        q2= np.percentile(x_train[col],b)
        
        for pos in range(len(x_train)):
            if x_train[col].iloc[pos]>q2 or x_train[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(x_train)*100, 2)                       # Ratio of outliers
    x_train.drop(x_train.index[outliers], inplace=True)    # remove the outliers from the training dataset
    y_train.drop(y_train.index[outliers], inplace=True)
    
    
    
    return ratio, x_train, y_train

In [None]:
ratio, x_train, y_train= Winsorization_Method(numerical_cols, x_train, y_train , a=1, b=99)

In [None]:
print(f"Ratio of Outliers detected in Dataset:{ratio}")

# Data Preprocessing

In [None]:
# Let scale the numerical values
scaler=RobustScaler()
x_train[numerical_cols]= scaler.fit_transform(x_train[numerical_cols])
x_test[numerical_cols]= scaler.fit_transform(x_test[numerical_cols])

In [None]:
x_train

In [None]:
x_test

# Lets do the Modelling
> # Voting Classifier

In [None]:
# Create the list to store the result
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
auc_roc_scores = []

# Create a Model
gbc= GradientBoostingClassifier(max_features=11, min_samples_split=5)
gbc1= GradientBoostingClassifier(max_features=11, min_samples_split=5)
rf=RandomForestClassifier(max_features=11, min_samples_leaf=10, n_estimators=300)
lr= LogisticRegression(penalty='l1' , solver='liblinear')


voting_clf= VotingClassifier(estimators=[('gradient_boosting', gbc),('gradient_boosting1', gbc1),('random_forest',rf),('logistic_regression',lr)], voting='soft', n_jobs=-1)
voting_clf.fit(x_train, y_train)
y_pred= voting_clf.predict(x_test)


accuracy_scores.append(accuracy_score(y_pred, y_test))
precision_scores.append(precision_score(y_pred, y_test))
recall_scores.append(recall_score(y_pred, y_test))
f1_scores.append(f1_score(y_pred, y_test))
auc_roc_scores.append(roc_auc_score(y_pred, y_test))


# Print the Results
print(f"Accuracy:{accuracy_scores}")
print(f"ROC AUC:{auc_roc_scores}")
print(f"Recall:{recall_scores}")
print(f"Precision:{precision_scores}")
print(f"F1-Score:{f1_scores}")


print("Classification_Report")
print("-----------------------")
print(classification_report(y_test,y_pred))
print("Confusion_Matrix")
print("----------------------")
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

In [None]:
# create a result dataframe
model_names = ['VotingClassifier']
result_df=pd.DataFrame({"Accuracy": accuracy_scores, "Precision_Score":precision_scores, "Recall_Score":recall_scores, "F1_Score":f1_scores, "AUC_ROC_Score":auc_roc_scores}, index=model_names)
result_df

# Visualise the Results

In [None]:
result_df.T.sort_values(by="VotingClassifier", ascending=False).plot(kind="bar", grid=True, figsize=(5,5), color="blue").legend(bbox_to_anchor=(1.5,1));

# Simple Neural Network

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential



accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
auc_roc_scores = []


# create a model
model=Sequential()
model.add(Dense(10, input_dim=x_train.shape[1],activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(200, activation="relu"))
model.add(Dense(300, activation="relu"))
model.add(Dense(400, activation="relu"))
model.add(Dense(1, activation="sigmoid"))         # In Output layer sigmoid is used as a activation function 

# Compile the model
model.compile(optimizer=tf.keras.optimizers.SGD(lr=0.1),loss=tf.keras.losses.BinaryCrossentropy(), metrics=["accuracy"])


# Callbacks-- Stop training automatically once the model performance stop improving
callback= tf.keras.callbacks.EarlyStopping(monitor="loss", patience=4)

# Fit the model
model.fit(x_train, y_train,epochs=500, batch_size=32, callbacks=[callback],verbose=1)

# Predictions 
y_pred= model.predict(x_test)
y_preds= tf.round(y_pred)



accuracy_scores.append(accuracy_score(y_preds, y_test))
precision_scores.append(precision_score(y_preds, y_test))
recall_scores.append(recall_score(y_preds, y_test))
f1_scores.append(f1_score(y_preds, y_test))
auc_roc_scores.append(roc_auc_score(y_preds, y_test))


# Print the Results
print(f"Accuracy:{accuracy_scores}")
print(f"ROC AUC:{auc_roc_scores}")
print(f"Recall:{recall_scores}")
print(f"Precision:{precision_scores}")
print(f"F1-Score:{f1_scores}")


print("Classification_Report")
print("-----------------------")
print(classification_report(y_test,y_preds))
print("Confusion_Matrix")
print("----------------------")
ConfusionMatrixDisplay.from_predictions(y_test, y_preds)
plt.show()

In [None]:
# create a result dataframe
model_names = ['SimpleNeuralNetwork']
result_df=pd.DataFrame({"Accuracy": accuracy_scores, "Precision_Score":precision_scores, "Recall_Score":recall_scores, "F1_Score":f1_scores, "AUC_ROC_Score":auc_roc_scores}, index=model_names)
result_df

In [None]:
result_df.T.sort_values(by="SimpleNeuralNetwork", ascending=False).plot(kind="bar", figsize=(5,5), color="green", grid=True).legend(bbox_to_anchor=(1.6,1));

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Red;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Conclusion
</p>
</div>


> "If my notebook brings you value, a small gesture like upvoting on Kaggle would immensely boost my confidence. Your support fuels my motivation to contribute more. Thank you for being a part of this journey!"😊

![](https://i.gifer.com/6F8w.gif)