<a href="https://colab.research.google.com/github/arjun101sharma/Arjun-Sharma/blob/main/Classification_Cardiovascular_Risk_Prediction_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**  -  Cardiovascular Risk Prediction.

#### **Project Type**    - Classification
#### **Contribution**    - Individual
#### **Team Member 1**-  $\color{orange}{\text{Arjun Sharma}}$

# **Project Summary -**

**Data Preprocessing** :

1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Data Cleaning and Feature Engineering

**Exploratory data analysis(EDA) :**

1. Count of Presence or Absence of Cardiovascular Disease.

2. The number of individuals with/without cardiovascular disease. For both males and females??

3. Distribution of number of people with/without cardiovascular disease of Males and Females for the different age groups in count plot??

4. Distribution of Heart rate measure of Males and Females for the different age groups in line plot??

5. Distribution of number of people with/without cardiovascular disease of Males and Females for the different age groups in line plot??

6. Age Distribution of people having Cardiovascular disease and not having Cardiovascular disease in violin plot??

7. Distribution of Cholesterol measure of Males and Females for the different age groups in line plot??

8. Bar plot for cardiac risk disease for different cholesterol level.

9. Bar plot for cardiac risk disease based on daily cigarettes consumption.

10. Correlation Heatmap.

11. Pair Plot.

**Supervised Classification Machine learning algorithms and implementation :**

1. Logistic regression.

2. Decision Tree.

3. Random Forest.

4. SVM (Support Vector Machine).

5. Xtreme Gradient Boosting.

6. Naive Bayes.

7. Neural Network.

8. Selection of best model.

Cardiovascular diseases (CVDs) are a leading cause of mortality and morbidity worldwide. Early prediction and identification of individuals at high risk of developing CVDs are crucial for effective preventive strategies and personalized healthcare interventions. This project aims to develop and evaluate a robust cardiovascular risk prediction model using advanced machine learning techniques.

Objectives:
The primary objective of this project is to create an accurate and reliable machine learning model capable of predicting an individual's risk of developing cardiovascular diseases within a specified time frame. Secondary objectives include identifying key risk factors, assessing the model's performance, and comparing it with existing risk assessment tools.

Methodology:

Data Collection: A diverse and comprehensive dataset containing demographic information, medical history, lifestyle factors, and clinical measurements will be collected from electronic health records, surveys, and relevant literature.

Feature Engineering: Relevant features will be selected and engineered from the dataset. Feature scaling, normalization, and handling missing values will be performed to ensure data quality.

Model Development: Various machine learning algorithms, including but not limited to logistic regression, random forests, support vector machines, and gradient boosting, will be employed to develop the risk prediction model. Hyperparameter tuning will be conducted to optimize model performance.

Validation and Evaluation: The model will be validated using cross-validation techniques and evaluated using metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC).

Comparison with Existing Models: The developed model's performance will be compared with established cardiovascular risk assessment tools, such as the Framingham Risk Score and the SCORE risk charts, to determine its superiority in accuracy and reliability.

Interpretability and Visualization: Model interpretability techniques, such as feature importance analysis and SHAP (SHapley Additive exPlanations) values, will be employed to gain insights into the factors driving the predictions. Visualization tools will be used to communicate these findings effectively.

Results and Implications:
The successful development of a robust cardiovascular risk prediction model holds several potential implications for healthcare and clinical practice:

Personalized Risk Assessment: The model can provide individuals with a personalized risk score, enabling them to take proactive steps to mitigate their risk factors and adopt healthier lifestyles.

Resource Allocation: Healthcare systems can allocate resources more efficiently by identifying individuals with high predicted risk, allowing targeted interventions and reducing the overall burden of CVDs.

Preventive Strategies: The model's identification of key risk factors can guide the development of more effective preventive strategies, including public health campaigns and early interventions.

Clinical Decision Support: Clinicians can utilize the model's predictions to make informed decisions about patient care, focusing on high-risk individuals who may benefit from closer monitoring and more aggressive interventions.

Conclusion:
The development of an accurate and reliable cardiovascular risk prediction model using advanced machine learning techniques has the potential to revolutionize the way we approach cardiovascular disease prevention and management. By leveraging comprehensive datasets and state-of-the-art algorithms, this project aims to provide a valuable tool for personalized healthcare and public health initiatives, ultimately contributing to the reduction of cardiovascular disease burden in the population.

# **GitHub Link -**

https://github.com/arjun101sharma/Arjun-Sharma

# **Problem Statement**


Cardiovascular diseases (CVDs) are a leading cause of mortality and morbidity worldwide. Early identification and accurate prediction of cardiovascular risk factors are crucial for effective prevention and intervention strategies. Traditional risk assessment models often rely on limited clinical parameters and may not provide personalized predictions.

The challenge lies in developing a robust and accurate cardiovascular risk prediction model that integrates a wide range of patient-specific data, including demographic information, medical history, lifestyle factors, and biomarker measurements. This model should surpass the limitations of existing approaches by leveraging advanced machine learning techniques to handle complex interactions between various risk factors and improve prediction accuracy.

Key Objectives:

Data Integration: Gather and preprocess diverse datasets encompassing patient demographics, medical records, genetic information, lifestyle habits, and relevant biomarker measurements.

Feature Engineering: Identify informative features and develop novel feature engineering strategies that capture intricate relationships between risk factors, potentially involving nonlinear interactions and higher-order correlations.

Model Development: Explore and implement state-of-the-art machine learning algorithms, such as deep learning, ensemble methods, and feature selection techniques, to construct a comprehensive risk prediction model.

Personalization: Create a model that generates personalized risk assessments, accounting for individual variations in genetics, lifestyle, and medical history. This should move beyond one-size-fits-all approaches.

Interpretability: Strive to achieve transparency and interpretability in the model's predictions, enabling medical professionals to understand the rationale behind risk assessments and facilitating patient-doctor communication.

Evaluation: Rigorously evaluate the developed model using appropriate performance metrics and benchmark it against existing risk prediction models. Cross-validation and external validation should be employed to assess the model's generalization ability.

Clinical Applicability: Ensure the model's feasibility for integration into clinical workflows. Develop user-friendly interfaces that allow healthcare providers to input patient data easily and obtain risk predictions promptly.

Ethical Considerations: Address potential biases in the data and model predictions, and implement measures to mitigate these biases. Ensure patient data privacy and compliance with relevant regulations (e.g., GDPR, HIPAA).

The successful completion of this project will yield a cutting-edge cardiovascular risk prediction model that contributes to more accurate and personalized preventive healthcare strategies. By providing clinicians and patients with actionable insights, this model has the potential to reduce the burden of cardiovascular diseases and improve overall public health.

# ***Let's Begin !***

## ***1. Know our Data***

### Import Libraries

In [None]:
# Import Libraries
# Import necessary libraries for data analysis and visualization
# Data manipulation and analysis
# Importing pandas library for data manipulation and analysis
import pandas as pd

# Importing numpy library for numerical operations
import numpy as np

# Importing matplotlib for basic plotting and visualization
import matplotlib.pyplot as plt

# Importing rcParams for customizing plot parameters
from matplotlib import rcParams

# Importing SMOTE and SMOTETomek from imblearn for oversampling techniques
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

# Importing rainbow colormap for visualizations
from matplotlib.cm import rainbow

# Magic command to display plots inline in Jupyter notebooks
%matplotlib inline

# Importing warnings to manage and suppress warnings
import warnings

# Suppressing warnings for cleaner output
warnings.filterwarnings('ignore')

# Importing plotly express for statistical data visualization
import plotly.express as px

# Importing seaborn for statistical data visualization
import seaborn as sns

# Importing xgboost for gradient boosting algorithm
import xgboost as xgb

# Importing mathematical functions from numpy
from numpy import math

# Importing train_test_split from sklearn for splitting data into training and testing sets
from sklearn.model_selection import train_test_split

# Importing StandardScaler from sklearn for feature scaling
from sklearn.preprocessing import StandardScaler

# Importing various metrics from sklearn for model evaluation
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, classification_report

# Importing RepeatedStratifiedKFold for cross-validation
from sklearn.model_selection import RepeatedStratifiedKFold

# Importing specific metrics from sklearn for evaluation
from sklearn.metrics import recall_score, precision_score, f1_score

# Importing GridSearchCV and RandomizedSearchCV from sklearn for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Importing classifiers from sklearn for modeling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Importing XGBClassifier from xgboost for extreme gradient boosting
from xgboost import XGBClassifier as xgb

# Additional metrics imports for model evaluation
from sklearn.metrics import f1_score, accuracy_score, classification_report



In [None]:
from sklearn.metrics import confusion_matrix

### Dataset Loading

In [None]:
# This code snippet demonstrates the use of the chardet library to detect the character encoding of a given CSV file.
file = "/content/data_cardiovascular_risk.csv"
import chardet

# The file path is specified, and the chardet library is used to analyze the first 100,000 bytes of the file's raw binary data.
with open(file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# The detected character encoding information is then stored in the 'result' variable, which can be used to determine the appropriate encoding for reading the file.
result

In [None]:
# Load Dataset
hd_df = pd.read_csv("/content/data_cardiovascular_risk.csv")

### Dataset First View

In [None]:
# Dataset First Look
# Dataset head Look
hd_df.head() # Display the first 5 rows of the DataFrame.


In [None]:
# Dataset tail Look
hd_df.tail()  # Display the last 5 rows of the DataFrame.

### Dataset Rows & Columns count

In [None]:
# Display the number of rows in the dataset using the 'shape' attribute
print(f'Number of rows in the data set is {hd_df.shape[0]}')

# Display the number of columns in the dataset using the 'shape' attribute
print(f'Number of Columns in the data set is {hd_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
hd_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_counts = hd_df.duplicated(keep=False).sum()

# Display the count of duplicate rows
print("Number of duplicate rows:", duplicate_counts)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values in each column
missing_values_count = hd_df.isnull().sum()
print(missing_values_count)

In [None]:
# Visualizing the missing values

# Setting the figure size for the visualization
plt.figure(figsize=(15, 5))

# Creating a heatmap to visualize missing values in the DataFrame
sns.heatmap(hd_df.isnull(), cmap='plasma', annot=False, yticklabels=False)

# Adding a title to the visualization
plt.title("Visualizing Missing Values")

# Displaying the visualization
plt.show()

### What did you know about your dataset?

**Answer Here.**
#### The data set has 3390 rows 18 columns.
#### Data types our data set are: float64(9), int64(6), object(2).
#### Zero Duplicate values/rows.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hd_df.columns

In [None]:
# Dataset Describe
hd_df.describe(include='all')

### Variables Description

Answer Here
#### **Age:**
The age of the individual is a significant factor in cardiovascular risk assessment, as the risk of CVD tends to increase with age.

#### **Previous Cardiovascular Events:**
A history of heart attack, stroke, or other cardiovascular events increases the risk of future events.
#### **Diabetes:**
Individuals with diabetes are at higher risk of CVD.
#### **Hypertension:**
High blood pressure is a major risk factor for CVD.
#### **Hyperlipidemia:**
Abnormal lipid levels can contribute to CVD risk.

#### **Cholesterol Levels:**
Levels of total cholesterol, HDL cholesterol (good cholesterol), LDL cholesterol (bad cholesterol), and triglycerides are important predictors of CVD risk.
#### **Body Mass Index (BMI):**
BMI is used as an indicator of obesity, which is associated with increased CVD risk.
#### **Smoking Status:**
 Whether the individual is a smoker or not is a significant risk factor for CVD.


#### **Demographic Information:**

#### **Gender:**
The patient's sex, indicated as either "M" for male or "F" for female.
#### **Age:**
 The patient's age, represented as a continuous variable. Although the recorded ages are rounded to whole numbers, age is actually a continuous concept.
#### **Education:**
The patient's level of education categorized as values 1, 2, 3, or 4.

#### **Behavioral Information:**

#### **Smoking Status:**
Indicates whether the patient is a current smoker, with values "YES" for yes and "NO" for no.
#### **Cigarettes Per Day:**
The average number of cigarettes smoked by the individual per day. This can be considered continuous, as any number of cigarettes, even fractions like half a cigarette, can be used.

#### **Medical History:**

#### **Blood Pressure Medication:**
Indicates whether the patient was taking blood pressure medication, categorized as nominal (no specific numerical values).
#### **Previous Stroke:**
Indicates whether the patient had previously experienced a stroke, categorized as nominal (no specific numerical values).
Hypertension Status: Indicates whether the patient had hypertension (high blood pressure), categorized as nominal (no specific numerical values).
Diabetes Status: Indicates whether the patient had diabetes, categorized as nominal (no specific numerical values).

#### **Current Medical Information:**

#### **Total Cholesterol:**
The patient's total cholesterol level, represented as a continuous variable.
#### **Systolic Blood Pressure:**
The patient's systolic blood pressure, represented as a continuous variable.
#### **Diastolic Blood Pressure:**
The patient's diastolic blood pressure, represented as a continuous variable.
#### **Body Mass Index (BMI):**
The patient's Body Mass Index, represented as a continuous variable.
#### **Heart Rate:**
The patient's heart rate, treated as a continuous variable. Although heart rate is discrete in reality, it is considered continuous in medical research due to its large range of possible values.
#### **Glucose Level:**
The patient's glucose level, represented as a continuous variable.

#### **Predictive Variable (Target):**

10-Year Risk of Coronary Heart Disease (CHD): A binary variable where "1" indicates a "Yes" for a 10-year risk of coronary heart disease, and "0" indicates a "No."

### Check Unique Values for each variable.


In [None]:
# Check Unique Values for each variable.
# Display unique values for each variable
for column in hd_df.columns:
    unique_values = hd_df[column].nunique()
    print(f"Number of Unique values for {column}:\n{unique_values}.")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Renaming columns in the hd_df DataFrame for improved readability and clarity.
# The new column names are chosen to provide more intuitive names and consistency.
hd_df.rename(columns={'sex':'gender','age':'Age','is_smoking':'Weather_smoking_or_not','cigsPerDay':'Cigrets_smoked_per_day'
,'BPMeds':'Weather_taking_BP_meds_or_not','prevalentStroke':'If_the_patient_has_a_history_of_stroke'
                     ,'prevalentHyp':'If_the_patient_has_a_history_of_hypertension',
                      'diabetes':'Patient_has_diabeties_or_not','totChol':'Cholesterol_measure','sysBP':'Systolic_BP_measure'
                    ,'diaBP':'Diastolic_BP_measure','BMI':'Body_Mass_Index'
                    ,'heartRate':'Heart_Rate_measure','TenYearCHD':'Presence_or_absence_of_cardiovascular_disease'},
                inplace=True)

In [None]:
# Convert 'gender' column values from 'M' and 'F' to 1 and 0 respectively
hd_df['gender'] = hd_df['gender'].replace({'M': 1, 'F': 0})

# Convert 'Weather_smoking_or_not' column values from 'YES' and 'NO' to 1 and 0 respectively
hd_df['Weather_smoking_or_not'] = hd_df['Weather_smoking_or_not'].replace({'YES': 1, 'NO': 0})

# Fill missing values in specific columns with their median values
hd_df['Cigrets_smoked_per_day'] = hd_df['Cigrets_smoked_per_day'].fillna(hd_df['Cigrets_smoked_per_day'].median())
hd_df['Body_Mass_Index'] = hd_df['Body_Mass_Index'].fillna(hd_df['Body_Mass_Index'].median())
hd_df['Heart_Rate_measure'] = hd_df['Heart_Rate_measure'].fillna(hd_df['Heart_Rate_measure'].median())
hd_df['Cholesterol_measure'] = hd_df['Cholesterol_measure'].fillna(hd_df['Cholesterol_measure'].median())
hd_df['glucose'] = hd_df['glucose'].fillna(hd_df['glucose'].median())
hd_df['education'] = hd_df['education'].fillna(hd_df['education'].median())
hd_df['Weather_taking_BP_meds_or_not'] = hd_df['Weather_taking_BP_meds_or_not'].fillna(hd_df['Weather_taking_BP_meds_or_not'].median())

# Convert all columns to integer data type
for var in hd_df.columns:
    hd_df[var] = hd_df[var].astype(int)

# Create a copy of the dataset and perform one-hot encoding on the 'education' column
data_set = hd_df.copy()
data_set = pd.get_dummies(data_set, columns=['education'])



In [None]:
#creating list of numerical and categorical columns
continuous_variable=[]
for col in hd_df.columns:
  if hd_df[col].nunique()>4:
    continuous_variable.append(col)

In [None]:
print(continuous_variable)
continuous_variable.remove('id')
print(continuous_variable)

In [None]:
# Write your code to make your dataset analysis ready.

#Checking for outliers

fig = plt.figure(figsize=(8,25))
c=1
for i in continuous_variable :
    plt.subplot(10, 4, c)
    plt.xlabel('Distibution of {}'.format(i))
    sns.boxplot(x=i,data=hd_df,color="tomato")
    c = c + 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

In [None]:
# Checking for outliers using normal distribution plots

# Set the figure size
plt.figure(figsize=(12, 10))

# Number of columns and rows for subplots
num_cols = 4
num_rows = len(continuous_variable) // num_cols + 1

# Iterate through continuous variables and create subplots
for index, variable in enumerate(continuous_variable, start=1):
    plt.subplot(num_rows, num_cols, index)

    # Plot the normal distribution using histogram with kernel density estimate (KDE)
    sns.histplot(x=variable, data=hd_df, kde=True, color="skyblue", stat="density")

    # Set plot title and labels
    #plt.title(f'{variable}', fontsize=14)
    plt.xlabel(variable, fontsize=12)
    plt.ylabel('Density', fontsize=12)

    # Add mean and median lines for reference
    plt.axvline(hd_df[variable].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
    plt.axvline(hd_df[variable].median(), color='green', linestyle='dashed', linewidth=1, label='Median')

    # Show legend
    plt.legend()

# Adjust layout to prevent overlapping of subplots
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

# Show the plots
plt.show()


In [None]:
# Set the figure size
plt.figure(figsize=(20, 15))

# Number of columns and rows for subplots
num_cols = 2
num_rows = len(continuous_variable)

# Iterate through continuous variables and create subplots
for index, variable in enumerate(continuous_variable, start=1):
    # Box Plot
    plt.subplot(num_rows, num_cols, index * 2 - 1)
    sns.boxplot(x=hd_df[variable], color="tomato")
    plt.xlabel('Boxplot of {}'.format(variable))

    # Distribution Plot with KDE
    plt.subplot(num_rows, num_cols, index * 2)
    sns.histplot(x=hd_df[variable], kde=True, color="skyblue", stat="density")

    # Set plot title and labels for distribution plot
    plt.title(f'Distribution of {variable}', fontsize=14)
    plt.xlabel(variable, fontsize=12)
    plt.ylabel('Density', fontsize=12)

    # Add mean and median lines for reference
    plt.axvline(hd_df[variable].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
    plt.axvline(hd_df[variable].median(), color='green', linestyle='dashed', linewidth=1, label='Median')

    # Show legend for distribution plot
    plt.legend()

# Adjust layout to prevent overlapping of subplots
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

# Show the plots
plt.show()


In [None]:
# Managing Extreme Values and Dealing with Outliers
for column in continuous_variable:
  Q_1,Medium,Q_3 = hd_df[column].quantile([0.25,0.5,0.75])
  Lower_Limit = Q_1 - 1.5*(Q_3-Q_1)
  Upper_Limit = Q_3 + 1.5*(Q_3-Q_1)


  # Substituting values that fall far from the typical range.
  hd_df[column] = np.where(hd_df[column] > Upper_Limit,Upper_Limit,np.where(
                              hd_df[column] < Lower_Limit,Lower_Limit,hd_df[column]))

In [None]:
# Set the figure size
plt.figure(figsize=(20, 15))

# Number of columns and rows for subplots
num_cols = 2
num_rows = len(continuous_variable)

# Iterate through continuous variables and create subplots
for index, variable in enumerate(continuous_variable, start=1):
    # Box Plot
    plt.subplot(num_rows, num_cols, index * 2 - 1)
    sns.boxplot(x=hd_df[variable], color="tomato")
    plt.xlabel('Boxplot of {}'.format(variable))

    # Distribution Plot with KDE
    plt.subplot(num_rows, num_cols, index * 2)
    sns.histplot(x=hd_df[variable], kde=True, color="skyblue", stat="density")

    # Set plot title and labels for distribution plot
    plt.title(f'Distribution of {variable}', fontsize=14)
    plt.xlabel(variable, fontsize=12)
    plt.ylabel('Density', fontsize=12)

    # Add mean and median lines for reference
    plt.axvline(hd_df[variable].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
    plt.axvline(hd_df[variable].median(), color='green', linestyle='dashed', linewidth=1, label='Median')

    # Show legend for distribution plot
    plt.legend()

# Adjust layout to prevent overlapping of subplots
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

# Show the plots
plt.show()

In [None]:
hd_df.head()

In [None]:
hd_df.info()

### What all manipulations have you done and insights you found?

Answer Here.
Altered the column names to make them more user-friendly and convenient.
Convert categorical variables into binary-encoded columns using one-hot encoding.

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.
I used drop null values technique(dropna(subset=[column], inplace=True)).I remove all the rows having null values in my dataset null rows are less in numbers It won't reduce my dateset rows much.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Chart - 1
## Count of Presence or Absence of Cardiovascular Disease.


In [None]:
# Chart - 1 visualization code.

# Set the plotting style to 'darkgrid' for better aesthetics
sns.set_style('darkgrid')

# Create a countplot using seaborn for the target variable with an appropriate color palette
ax = sns.countplot(x=hd_df['Presence_or_absence_of_cardiovascular_disease'], palette='summer')

# Set the label for the x-axis and adjust the font size and padding
plt.xlabel('Presence or Absence of Cardiovascular Disease', labelpad=15, size=14, fontweight='bold')

# Set the label for the y-axis and adjust the font size
plt.ylabel('Count', size=14, fontweight='bold')

# Set the title of the plot and adjust the font size
plt.title('Distribution of Cardiovascular Disease',  size=16, fontweight='bold')

# Improve visibility of x-axis labels
ax.set_xticklabels(['No Cardiovascular Disease', 'Cardiovascular Disease'], fontsize=12)

# Set the size of the figure for better visibility
plt.figure(figsize=(12, 6))

# Annotate each bar with its count to provide detailed information on the plot
for p in ax.patches:
    # Annotate the bar with count at an appropriate position and adjust font size
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=12, color='black', xytext=(0, 10),
                textcoords='offset points')

# Show the plot for visualization
plt.show()


##### **1. Why did you pick the specific chart?**

**Answer Here.**

Countplot charts are commonly employed to illustrate the relative portions of a complete entity, making them particularly effective for representing data that has been computed as a percentage of the whole.

##### **2. What is/are the insight(s) found from the chart?**

#### **Answer Here.**

From the above chart we come to know that 15.1% that is 511 out of 3390 are classified as positive for 10 year CHD whereas the remaining 84.9% that is 2879 out of 3390 are classified as negative for 10 year CHD.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

The insights from the chart show that 15.1% of the population has a positive classification for 10-year CHD, while 84.9% have a negative classification. This information can help businesses in the healthcare industry develop targeted strategies. There are no specific insights in the chart that indicate negative growth, but failure to address high CHD prevalence could have negative implications for public health and healthcare businesses.


## Chart-2
## The number of individuals with/without cardiovascular disease. For both males and females??



In [None]:
# Visualization code

# Set the seaborn style
sns.set_style('whitegrid')

# Create the countplot with appropriate color palette
plt.figure(figsize=(8, 6))  # Set the figure size for better visibility
ax = sns.countplot(data=hd_df, x='gender', hue='Presence_or_absence_of_cardiovascular_disease', palette='Set2')

# Annotate the bars with counts
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=12, color='black')

# Set labels and title with appropriate font size and style
plt.xlabel('Gender', size=14, fontweight='bold')
plt.ylabel('Count', size=14, fontweight='bold')
plt.title('Distribution of Cardiovascular Disease by Gender', size=16, fontweight='bold')

# Customize legend
legend_labels = ['No Cardiovascular Disease', 'Cardiovascular Disease']
plt.legend(title='Health Status', labels=legend_labels, title_fontsize='14', fontsize='12')

# Improve visibility of x-axis labels
ax.set_xticklabels(['Male', 'Female'], fontsize=12)

# Improve grid visibility
ax.grid(visible=True, linestyle='--', linewidth=0.5)

# Add a horizontal line at y=0 for better reference
plt.axhline(y=0, color='black', linewidth=0.5, linestyle='--')

# Add plot border
sns.despine(left=True, bottom=True)

# Show the plot
plt.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

We have chosen an appropriate chart type (grouped bar chart) and made several design decisions to ensure clarity and readability, making it an effective way to represent and compare the distribution of cardiovascular disease cases by gender.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

The heights of the bars represent the count of individuals in the dataset. For both males and females, we can see how many individuals are present in the dataset.
Out of total 1923 Males 1684 Males do not have cardiovascular disease and 239 Males do have cardiovascular disease.
Out of total 1467 Females 1195 Females do not have cardiovascular disease and 272 Females do have cardiovascular disease.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


**Insights:**

The chart shows that there are more females with cardiovascular disease than males. This might indicate that cardiovascular disease is more prevalent among females in the given dataset.
The number of individuals without cardiovascular disease appears higher for males compared to females.
**Potential Positive Business Impact:**

**Targeted Marketing and Healthcare Services:** If a business operates in the healthcare industry, especially in areas related to cardiovascular health, this insight can guide targeted marketing efforts and the development of specialized healthcare services. For instance, there could be a demand for specialized cardiovascular clinics for women.

**Product Development:** Companies manufacturing products related to cardiovascular health might consider developing products tailored to the needs of females, considering the higher prevalence indicated in the dataset. These products could include specialized medications, health supplements, or medical devices.

**Health Insurance and Risk Assessment:** Insurance companies could utilize this information to refine their risk assessment models. They might offer specific health insurance packages or discounts to females, considering the higher likelihood of cardiovascular diseases.

**Potential Negative Impact:**

**Health Disparities:** If not addressed properly, this disparity in cardiovascular disease prevalence between genders could contribute to existing health disparities. Businesses and policymakers need to ensure that this information does not lead to discrimination in access to healthcare or insurance based on gender.

**Increased Healthcare Costs:** For businesses providing health insurance to their employees, a higher prevalence of cardiovascular disease, especially among females, could lead to increased healthcare costs. This might result in higher premiums for health insurance plans, affecting both employers and employees.



## Chart-3
## Distribution of number of people with/without cardiovascular disease of Males and Females for the different age groups in count plot??



In [None]:
# Visualization code

# Set the figure size for the visualization
rcParams['figure.figsize'] = 14, 6

# Create a count plot using Seaborn
sns.set(style="whitegrid")  # Set the style of the plot
plot = sns.countplot(x='Age', hue='Presence_or_absence_of_cardiovascular_disease', data=hd_df, palette="Set2")

# Customize the plot for better understanding
plt.title('Distribution of Cardiovascular Disease Across Age Groups', fontsize=16, fontweight='bold')  # Add a title to the plot
plt.xlabel('Age', fontsize=14, fontweight='bold')  # Label for X-axis
plt.ylabel('Count', fontsize=14, fontweight='bold')  # Label for Y-axis

# Set x-axis tick labels to integer data type
plot.set_xticklabels(plot.get_xticks(), rotation=45, fontsize=12)  # Rotate x-axis labels for better visibility

# Customize legend for better interpretation.
legend_labels = ['No Cardiovascular Disease', 'Cardiovascular Disease']
plt.legend(title='Presence or Absence of Cardiovascular Disease',labels=legend_labels, title_fontsize='x-large', fontsize='large')

# Add labels to the bars for detailed information
for p in plot.patches:
    plot.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                  ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Show the plot
plt.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**
The count plot using Seaborn to visualize the distribution of cardiovascular disease across different age groups. The specific chart chosen here is a count plot because it effectively displays the count of observations in each age group while differentiating between individuals with cardiovascular disease and those without it.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

**Insights from the Chart:**
**Prevalence Across Age Groups:**

The chart provides a clear view of how the prevalence of cardiovascular disease varies across different age groups.
It appears that the incidence of cardiovascular disease generally increases with age. This aligns with the common understanding that the risk of cardiovascular issues tends to rise as individuals grow older.
**Higher Risk in Older Age:**

There is a significant increase in the count of individuals with cardiovascular disease in the older age groups, indicating that older individuals are more likely to have cardiovascular issues compared to younger age groups.
**No Cardiovascular Disease Dominates in Younger Age:**

In the younger age groups, there are noticeably more individuals without cardiovascular disease compared to those with the disease. This suggests that cardiovascular problems are relatively uncommon in younger populations.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


Each bar represents the count of individuals in a specific age group, segmented by their cardiovascular disease status.

**Potential Positive Business Impacts:**


**Targeted Marketing and Product Development:** Understanding which age groups are more prone to cardiovascular diseases can aid businesses in creating targeted marketing campaigns for specific age demographics. Additionally, it can inform the development of products or services catering to the health needs of these age groups.

Healthcare Services: Hospitals, clinics, and healthcare providers can use this information to improve their services for specific age groups. They can offer preventive health check-ups and early screening programs for the age groups most at risk.

**Insurance and Wellness Programs:** Insurance companies can design specialized insurance plans or wellness programs targeting individuals in age groups with higher cardiovascular disease prevalence. This targeted approach can lead to better customer engagement and satisfaction.

**Research and Development:** Pharmaceutical companies and research institutions can utilize this information to focus their research efforts on developing medications or therapies that specifically address cardiovascular issues prevalent in certain age groups.

**Potential Negative Business Impacts:**
Increased Healthcare Costs: If the data shows a significant prevalence of cardiovascular diseases in a particular age group, it could indicate higher healthcare costs for insurance providers and individuals within that demographic.

**Decreased Workforce Productivity:** If a specific age group within the workforce is more affected by cardiovascular diseases, it might lead to decreased productivity, increased sick leaves, and potential early retirements, impacting businesses relying heavily on that demographic.

**Challenges in Insurance Industry:** Insurance companies might face challenges in providing affordable coverage to individuals in age groups with higher cardiovascular risks, potentially leading to a reduction in the number of insured individuals.

## Chart - 4
## Distribution of Heart rate measure of Males and Females for the different age groups in line plot??

In [None]:
# Visualization code

# Calculate the mean heart rate measure for males.
mean_heart_rate_male = hd_df[hd_df['gender'] == 'M']['Heart_Rate_measure'].mean()

# Print the mean heart rate measure for males.
print("Mean Heart Rate Measure for Males: {:.2f}".format(mean_heart_rate_male))

In [None]:
# Group and aggregate the data
choletsrl_data = hd_df.groupby(['gender','Age'])['Heart_Rate_measure'].mean().reset_index()

# Create pivot table
pivot_table = choletsrl_data.pivot(index='Age', columns='gender', values='Heart_Rate_measure')

# Create the plot
pivot_table.plot(kind='line', marker='o', figsize=(15, 5),color=sns.color_palette('bright',12))

# Set labels and title
plt.xlabel('Age',size=15, fontweight='bold')
plt.ylabel('Heart Rate measure',size=15, fontweight='bold')
plt.title('Distribution of Males and Females having Heart Rate measure at different ages ',size=15, fontweight='bold')

#plt.plot(x, y1, label='Line 1')  # Label for the first line

# Customize legend for better interpretation.
legend_labels = ['Females', 'Males']
plt.legend(title='Gender',labels=legend_labels, title_fontsize='x-large', fontsize='large')

# Display the plot
plt.grid(True)
plt.show()

#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

 Line charts are excellent for comparing trends over a continuous variable (in this case, age). By plotting the heart rate measures over age, we can easily compare how the heart rate measures change for males and females as they age.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

The provided code creates a line chart to visualize the distribution of heart rate measures for males and females at different ages:

**Trend of Heart Rate with Age:**

The chart shows how average heart rate measures change with age for both males and females.
we can observe whether heart rate generally increases, decreases, or remains stable with age for both genders.
**Gender Differences:**

By comparing the lines for males and females, you can identify any significant differences in heart rate between genders across various age groups.
For example, if one gender consistently has a higher or lower heart rate than the other, it could indicate a gender-specific trend.

**Variability and Outliers:**

Variability in heart rate measures can be observed through the spread of data points around the lines. If there are wide variations, it suggests a diverse range of heart rate measures within each age group.
Outliers, if present, can also be identified. These are data points significantly different from the rest, indicating potential anomalies in heart rate data.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

 Analyzing this data visualization can potentially provide valuable insights into cardiovascular health patterns among different age groups and genders. Let's consider the possible positive business impacts, negative growth insights, and justifications for both scenarios:

**Potential Positive Business Impacts:**

**Healthcare Services and Products:**

**Targeted Services:** With insights into average heart rates across age and gender groups, healthcare providers can develop targeted services and products. For example, specialized health checkups or heart rate monitoring devices tailored for specific age and gender groups.
**Preventive Care:** Businesses can create preventive care programs and awareness campaigns aimed at specific demographics, encouraging healthier lifestyles and early detection of cardiovascular issues.
Pharmaceuticals:

**Medication Development:** Pharmaceutical companies can use this data to develop medications that are specifically tailored for certain age and gender groups, potentially leading to more effective treatments and medications.
Fitness and Wellness Industry:

**Fitness Programs:** Gyms and fitness centers can design age and gender-specific fitness programs. Understanding average heart rates can help in creating workout routines that are safe and effective for different demographics.

**Insurance Companies:**

**Risk Assessment:** Insurance companies can refine their risk assessment models based on gender and age, potentially leading to more accurate underwriting and pricing strategies for health insurance policies.
Potential Negative Growth Insights:
Healthcare Disparities:

**Identifying Disparities:** If there are significant disparities in heart rates among different age and gender groups, it could indicate underlying health disparities in the population. This might require interventions to address these gaps in healthcare access and quality.
Public Health Concerns:

**Potential Health Issues:** Unusually high or low heart rates in specific demographic groups could indicate potential health issues or risk factors. If these issues are not addressed, it could lead to a negative impact on public health and, consequently, economic productivity.
Medical Costs:

**Increased Healthcare Costs:** If certain demographic groups show consistently higher heart rates, it might indicate a higher prevalence of cardiovascular diseases. This could lead to increased healthcare costs for both individuals and the healthcare system, potentially impacting economic growth.
In summary, the insights gained from analyzing the heart rate data can indeed lead to positive business impacts by enabling targeted healthcare services, tailored products, and improved risk assessment strategies. However, it's crucial to address any negative growth insights promptly, focusing on interventions, awareness programs, and policies to mitigate disparities and potential health issues. Businesses and policymakers can collaborate to ensure that the insights derived from such data lead to positive outcomes for public health and the economy.







## Chart-5
## Distribution of number of people with/without cardiovascular disease of Males and Females for the different age groups in line plot??

In [None]:
# Visualization code

# Group and aggregate the data
data_grouped = hd_df.groupby(['gender','Age'])['Presence_or_absence_of_cardiovascular_disease'].sum().reset_index()

# Create pivot table
pivot_table = data_grouped.pivot(index='Age', columns='gender', values='Presence_or_absence_of_cardiovascular_disease')

# Create the plot
pivot_table.plot(kind='line', marker='o', figsize=(15, 5),color=sns.color_palette('bright',12))

# Set labels and title
plt.xlabel('Age',size=14,fontweight='bold')
plt.ylabel('Number of people having cardiovascular disease',size=12,fontweight='bold')
plt.title('Distribution of Males and Females having cardiovascular disease at different ages',size=14,fontweight='bold')


# Customize legend for better interpretation.
legend_labels = ['Females', 'Males']
plt.legend(title='Gender',labels=legend_labels, title_fontsize='x-large', fontsize='large')

# Display the plot
plt.grid(True)
plt.show()

#### **1. Why did you pick the specific chart?**

The line chart is an appropriate choice for visualizing the distribution of males and females with cardiovascular disease at different ages because it effectively communicates the trends, allows for easy comparison, and provides insights into how the disease prevalence varies with age for both genders.

#### **Answer Here.**


#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

The given code creates a line chart representing the distribution of males and females with cardiovascular disease at different ages.

**Age of Onset:** The chart can provide insights into the age at which males and females are more likely to develop cardiovascular diseases. Peaks or trends in the chart can indicate ages where the risk is significantly higher.

**Gender Comparison:** By comparing the lines for males and females, you can observe if there are differences in the prevalence of cardiovascular diseases between genders at different ages. For example, if one line consistently stays above the other, it indicates a gender-specific trend in disease prevalence.

**Age Range of Concern:** The chart can identify the specific age ranges where there is a notable increase or decrease in the number of people with cardiovascular diseases. This information is vital for healthcare professionals and policymakers to target specific age groups for preventive measures and interventions.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

The provided code groups and aggregates the data based on gender and age, then creates a line chart to visualize the distribution of males and females having cardiovascular disease at different ages.

**Potential Positive Business Impact:**
**Targeted Marketing and Prevention Programs:** If the chart shows that a particular age group and gender have a higher incidence of cardiovascular diseases, businesses and healthcare providers can tailor their marketing and prevention programs to target this demographic specifically. For example, they can promote relevant healthcare services, insurance plans, or wellness products to the at-risk population, potentially increasing sales and revenue.

**Product Development:** Insights into the age and gender groups most affected by cardiovascular diseases can lead to the development of new products or services catering to the specific needs of this demographic. This targeted approach can enhance customer satisfaction and loyalty.

**Healthcare Services Optimization:** Hospitals and healthcare providers can optimize their services based on the demographics most affected. For instance, if a specific age group and gender are found to have a higher incidence of cardiovascular diseases, hospitals can allocate resources, staff, and facilities accordingly to improve patient care.

**Research and Development:** Pharmaceutical companies and research institutions can focus their research efforts on developing medications and treatments that cater to the predominant demographic affected by cardiovascular diseases, potentially leading to breakthroughs and new revenue streams.

**Potential Negative Impact:**


**Increased Healthcare Costs:** If the chart indicates a rise in cardiovascular diseases among a particular demographic, it could lead to increased healthcare costs for both individuals and the government. Businesses might face challenges related to employee health insurance costs and reduced productivity due to employee health issues.

**Reputation Damage:** If a business is directly related to the healthcare industry and the insights from the chart reveal negative health trends, it could harm the business's reputation. Patients might avoid healthcare providers or services associated with high incidences of cardiovascular diseases.

**Decreased Workforce Productivity:** Cardiovascular diseases can lead to absenteeism and reduced productivity among employees. Businesses might face challenges in maintaining an efficient workforce, impacting overall productivity and profitability.

To draw specific conclusions about the positive or negative impact, it's crucial to analyze the actual chart and data. Understanding the trends, patterns, and their implications in the context of the business or industry is essential for making informed decisions that can lead to positive outcomes and mitigate potential negative impacts.

## Chart-6
## Age Distribution of people having Cardiovascular disease and not having Cardiovascular disease in violin plot??



In [None]:
# Visualization code

# Assuming hd_df is the DataFrame containing the data
# Select rows where 'Presence_or_absence_of_cardiovascular_disease' column has a value of 1
cardio_disease_df = hd_df[hd_df['Presence_or_absence_of_cardiovascular_disease'] == 1]
non_cardio_disease_df = hd_df[hd_df['Presence_or_absence_of_cardiovascular_disease'] == 0]

In [None]:
# Set the style for the plots
sns.set(style="whitegrid")

# Create a figure with subplots
plt.figure(figsize=(10, 6))

# Plot 1: Age distribution in non_cardio_disease_df
plt.subplot(1, 2, 1)
sns.violinplot(x=non_cardio_disease_df.Age, color='orange', orient='h')
plt.title('Age Distribution in Non-Cardiovascular diseases Cases')
plt.xlabel('Age')
plt.ylabel('')

# Add statistical information if desired
# mean_age = hd_df.Age.mean()
# median_age = hd_df.Age.median()
# plt.axvline(mean_age, color='blue', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# plt.axvline(median_age, color='green', linestyle='dashed', linewidth=2, label=f'Median Age: {median_age}')

# Print observations
print("Observations have been recorded mostly for people aged between 38 and 55 in non-cardiovascular cases.")

# Plot 2: Age distribution in cardio_disease_df
plt.subplot(1, 2, 2)
sns.violinplot(x=cardio_disease_df.Age, color='red', orient='h')
plt.title('Age Distribution in Cardiovascular Cases')
plt.xlabel('Age')
plt.ylabel('')

# Add statistical information if desired
# mean_age = cardio_disease_df.Age.mean()
# median_age = cardio_disease_df.Age.median()
# plt.axvline(mean_age, color='blue', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# plt.axvline(median_age, color='green', linestyle='dashed', linewidth=2, label=f'Median Age: {median_age}')

# Print observations
print("Observations have been recorded mostly for people with cardiovascular disease between the ages of 48 and 65.")

# Adjust layout and show the plot
plt.tight_layout()
plt.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**


A violin plot to visualize the distribution of ages in two subsets of the dataset: cases with cardiovascular disease and cases without cardiovascular disease (assuming hd_df represents cases without cardiovascular disease and cardio_disease_df represents cases with cardiovascular disease). Violin plots are used to display the distribution and probability density of numeric data across different categories or groups. In this case, the chart is displaying the age distribution for two specific groups: those with cardiovascular disease and those without cardiovascular disease.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

We are creating violin plots to visualize the age distribution in two different datasets: hd_df representing non-cardiovascular cases and cardio_disease_df representing cardiovascular cases. Here are the insights that can be drawn from the charts and the comments provided in the code:

**Non-Cardiovascular Cases (Left Plot - Orange Violin Plot):**

The age distribution for non-cardiovascular cases (represented by the orange violin plot) shows that observations have been recorded mostly for people aged between 38 and 55.
This group represents individuals without cardiovascular diseases.
**Cardiovascular Cases (Right Plot - Red Violin Plot):**

The age distribution for cardiovascular cases (represented by the red violin plot) indicates that observations have been recorded mostly for people with cardiovascular diseases between the ages of 48 and 65.
This group represents individuals diagnosed with cardiovascular diseases.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


The given code generates two violin plots comparing the age distribution between cases with cardiovascular disease and non-cardiovascular cases. Before discussing the positive and negative business impacts, let's analyze the insights gained from the plots and the provided observations:

**Non-Cardiovascular Cases (Left Plot):**

**Insight:** Most observations are recorded for people aged between 38 and 55 in non-cardiovascular cases.


**Potential Positive Impact:** Understanding the age group of people without cardiovascular disease can help in targeted marketing for preventive healthcare services, fitness products, and lifestyle-related interventions. Businesses can tailor their products and services to cater to the needs of this demographic, potentially leading to increased sales and positive business impact.

**Cardiovascular Cases (Right Plot):**

**Insight:** Observations have been recorded mostly for people with cardiovascular disease between the ages of 48 and 65.
Potential Negative Impact: This insight indicates a higher incidence of cardiovascular diseases in the age group of 48 to 65. From a business perspective, this could imply a potential market for pharmaceuticals, healthcare services, or specialized treatments targeting this age group. However, it also highlights a negative aspect - the prevalence of cardiovascular diseases in this age group, indicating a higher demand for healthcare services and potentially higher healthcare costs for individuals and insurance providers.

**Positive Business Impact:**

**Targeted Marketing:** Businesses can target the age group of 38-55 with products and services promoting healthy lifestyles, fitness equipment, and preventive healthcare services.

**Healthcare Industry Opportunities:** Pharmaceutical companies, healthcare providers, and insurance companies can develop specialized services and products for the age group of 48-65, creating business opportunities in the healthcare sector.
Negative Impact:

**Healthcare Costs:** The higher incidence of cardiovascular diseases in the age group of 48-65 indicates potential negative consequences for individuals and insurance providers due to increased healthcare costs. Businesses in the healthcare sector might face challenges in managing the costs associated with treating cardiovascular diseases.

while there are opportunities for businesses in the healthcare industry to address the specific needs of the identified age groups, the prevalence of cardiovascular diseases in the older age group also indicates potential challenges and increased costs in the healthcare sector. Businesses need to consider both the opportunities and challenges when strategizing their products and services, emphasizing preventive measures and innovative healthcare solutions to mitigate the negative impacts associated with the prevalence of cardiovascular diseases.


## Chart - 7
## Distribution of Cholesterol measure of Males and Females for the different age groups in line plot??



In [None]:
# Visualization code

# Group and aggregate the data
choloestrl_data = hd_df.groupby(['gender','Age'])['Cholesterol_measure'].mean().reset_index()

# Create pivot table
pivot_table = choloestrl_data.pivot(index='Age', columns='gender', values='Cholesterol_measure')

# Create the plot
pivot_table.plot(kind='line', marker='o', figsize=(15, 5),color=sns.color_palette('bright',12))

# Set labels and title
plt.xlabel('Age',size=14,fontweight='bold')
plt.ylabel('Cholesterol measure',size=14,fontweight='bold')
plt.title('Distribution of Males and Females having Cholesterol measure at different ages',size=14,fontweight='bold')

# Customize legend for better interpretation.
legend_labels = ['Females', 'Males']
plt.legend(title='Gender',labels=legend_labels, title_fontsize='x-large', fontsize='large')

# Display the plot
plt.grid(True)
plt.show()

#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

A line chart to visualize the distribution of cholesterol measures in males and females at different ages. This specific chart, a line chart, was chosen likely due to the nature of the data and the research question being addressed. Here's why a line chart might be appropriate for this scenario:

**Temporal Relationship:** Line charts are excellent for displaying data points in chronological order or in this case, in ascending age order. This chart seems to show how cholesterol measures change with age for both males and females, making a line chart a suitable choice.

**Comparison of Trends:** Line charts are effective for comparing trends over a continuous variable (in this case, age) between different categories (males and females). The lines for males and females can be compared easily, allowing viewers to discern patterns and differences in cholesterol measures as age increases.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**


The above line chart comparing the average cholesterol measures between males and females at different ages. Let's analyze the potential insights that can be derived from this chart:

**Cholesterol Trends with Age:**

The chart allows you to observe how average cholesterol measures change with age for both males and females. You can identify if there are specific age ranges where cholesterol levels tend to increase or decrease for either gender.

**Gender Differences:**

By comparing the lines representing males and females, you can determine if there are significant differences in cholesterol measures between the genders. For example, you might find that females have higher cholesterol levels than males in certain age groups or vice versa.
**Critical Age Groups:**

Peaks or valleys in the lines could indicate critical age groups where cholesterol levels spike or drop. Identifying these age groups is essential for understanding when individuals might be at higher risk for cardiovascular issues related to cholesterol.

**Gender Disparities:**

If there are consistent gaps between the lines, it suggests a consistent disparity in cholesterol levels between males and females across different ages. This insight could be crucial for health policymakers and practitioners to design targeted interventions.




#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


The provided code generates a line chart showing the distribution of cholesterol measures for males and females at different ages. It calculates the average cholesterol measure for each gender at various ages and presents the data in a line chart format.

Regarding the impact on business decisions, the insights gained from this chart can indeed be valuable. Here's how:

**Positive Business Impact:**

**Targeted Marketing:** If there are age groups where both males and females consistently show higher cholesterol measures, companies related to healthcare, pharmaceuticals, or health insurance can target these demographics with relevant products and services. For example, they could market cholesterol-lowering medications, health supplements, or health check-up packages.

**Health and Wellness Programs:** Employers and health insurance companies could use this information to design targeted health and wellness programs. For instance, they could offer cholesterol screening and awareness campaigns for specific age groups, leading to healthier employees and reduced healthcare costs.

**Product Development:** Companies in the pharmaceutical sector might invest in research and development of new drugs or treatments for managing cholesterol, especially if certain age groups consistently show higher cholesterol levels.

**Potential Negative Growth:**


**Increased Healthcare Costs:** If there are specific age groups where both genders show a consistent rise in cholesterol measures, this could lead to increased healthcare costs. Insurance companies might see higher claims from individuals within these age groups, potentially impacting their profitability.

**Public Health Concerns:** From a broader perspective, consistently high cholesterol levels in specific age groups could indicate a public health concern. If not addressed, this might lead to an increase in the prevalence of cardiovascular diseases, which could strain healthcare resources and negatively impact overall societal health.

On the other hand, if the high cholesterol levels are due to genetic factors, businesses could focus on developing targeted medications and therapies to address these specific needs, turning a potential negative impact into a positive one by providing solutions to a pressing health issue.






## Chart-8
## Bar plot for cardiac risk disease for different cholesterol level.



In [None]:
# Visualization code

# Define the bin edges for categorization
bins = [113, 200, 400, 600]  # Adjust the bin edges as needed

# Define the corresponding labels for the categories
labels = ['Normal', 'Above Normal', 'Well Above Normal']

# Use pd.cut to categorize the 'Cholesterol' column
hd_df['Cholesterol Category'] = pd.cut(hd_df['Cholesterol_measure'], bins=bins, labels=labels, right=False)

# Group the data by 'Cholesterol Category' and calculate the mean of 'Presence_or_absence_of_cardiovascular_disease'
age_chol = pd.DataFrame({
    'Cholesterol Category': labels,
    'Cardiac Disease Risk': hd_df.groupby('Cholesterol Category')['Presence_or_absence_of_cardiovascular_disease'].mean()
})

# Create a bar plot using Plotly Express
fig = px.bar(age_chol, x='Cholesterol Category', y='Cardiac Disease Risk',
             color='Cholesterol Category', title='Cardiac Disease Risk by Cholesterol Level')

# Customize the appearance of the plot
fig.update_traces(texttemplate='%{y:.2f}', textposition='outside')
fig.update_layout(title_text='Cardiac Disease Risk by Cholesterol Level', title_x=0.5)  # Centered title

# Show the plot
fig.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

The specific chart chosen for this analysis is a bar chart. This choice is appropriate for several reasons:

**Categorical Data:** The analysis involves categorizing cholesterol levels into discrete groups: 'Normal', 'Above Normal', and 'Well Above Normal'. Bar charts are excellent for displaying and comparing categorical data.

**Comparison of Categories:** Bar charts are great for comparing different categories or groups. In this case, the chart compares the risk of cardiovascular disease across different cholesterol categories.

**Easy Interpretation:** Bar charts are easy to understand and interpret, making them suitable for conveying the relationship between cholesterol levels and the risk of cardiovascular disease to a wide audience.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

The provided code categorizes individuals into different cholesterol levels (Normal, Above Normal, and Well Above Normal) based on the 'Cholesterol_measure' column in the dataset. It then calculates the mean of 'Presence_or_absence_of_cardiovascular_disease' for each cholesterol category and creates a bar chart using Plotly Express to visualize the cardiac disease risk associated with these cholesterol levels.

**Insights from the Chart:**

**Cardiac Disease Risk Increases with Higher Cholesterol Levels:**

The chart likely shows that individuals with higher cholesterol levels (Above Normal and Well Above Normal) have a higher mean probability of having cardiovascular diseases compared to those with Normal cholesterol levels. This suggests a positive correlation between cholesterol levels and the risk of cardiovascular diseases.


**Clear Categorization of Risk Levels:**

By categorizing cholesterol levels into distinct groups (Normal, Above Normal, and Well Above Normal), the chart provides a clear and easy-to-understand representation of how different cholesterol levels relate to the risk of cardiovascular diseases. This categorization helps in identifying thresholds beyond which the risk significantly increases.

**Potential Threshold Identification:**

The chart might help in identifying specific cholesterol level thresholds where the risk of cardiovascular diseases starts to significantly rise. This information can be valuable for healthcare professionals to set guidelines for patients regarding cholesterol management.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

Analyzing the impact of cholesterol levels on cardiovascular disease risk using the provided chart can indeed provide valuable insights for a business, especially if the business is related to healthcare, insurance, or wellness services. Let's evaluate potential positive and negative impacts based on the insights gained from the chart:

**Positive Business Impact:**

**Targeted Marketing and Product Development:** If the chart shows a clear trend that individuals with higher cholesterol levels have a significantly higher risk of cardiovascular diseases, businesses in the healthcare industry can target these individuals with specific products and services. This could include cholesterol-lowering medications, specialized diet plans, exercise programs, or regular health check-ups.

**Insurance and Health Services:** Insurance companies can use this information to refine their risk assessment models. Individuals with well above normal cholesterol levels might be charged higher premiums, creating a new revenue stream. Additionally, health services can be tailored to focus on preventive care for individuals with high cholesterol.

**Educational Campaigns:** Businesses can run educational campaigns to raise awareness about the risks associated with high cholesterol. This can lead to increased sales of health foods, gym memberships, and other wellness products and services.

**Partnerships and Collaborations:** Companies producing cholesterol-lowering medications or health foods might collaborate with healthcare providers to offer bundled services, creating a win-win situation for both businesses.

**Negative Business Impact:**


**Stigmatization Concerns:** Charging higher premiums or targeting specific products towards individuals with high cholesterol might lead to stigmatization or dissatisfaction among customers. This could harm the company's reputation and customer trust.

**Limited Market Scope:** Focusing solely on individuals with high cholesterol might limit the market scope. Businesses should be careful not to exclude individuals with normal cholesterol levels, as they also need healthcare products and services.

In summary, while the insights gained from the provided chart can lead to positive business impacts, there are potential negative consequences related to ethics, stigmatization, and regulatory compliance. Businesses must strike a balance between capitalizing on these insights and acting responsibly and ethically to ensure a positive impact on society and their bottom line.


## Chart - 9
## Bar plot for cardiac risk disease based on daily cigarettes consumption.




In [None]:
# Visualization code

# Define the bin edges for categorization
bins = [0, 1, 6, 15, 70]  # Adjust the bin edges as needed

# Define the corresponding labels for the categories
labels = ['Do not smoke','Normal', 'Above Normal', 'Well Above Normal']

# Use pd.cut to categorize the 'Cholesterol' column
hd_df['Cigrets_smoked_per_day_category'] = pd.cut(hd_df['Cigrets_smoked_per_day'], bins=bins, labels=labels, right=False)

In [None]:
# Group the data by 'Cigarettes_smoked_per_day_category' and calculate the mean of 'Presence_or_absence_of_cardiovascular_disease'
cigarette_groups = hd_df.groupby('Cigrets_smoked_per_day_category')['Presence_or_absence_of_cardiovascular_disease'].mean()

# Create a DataFrame for visualization
cigarette_data = pd.DataFrame({
    'Cardiac Disease Risk': cigarette_groups.values,
    'Cigarettes Smoked per Day': ['Do not smoke(0-1)', 'Normal(1-6)', 'Above Normal(6-15)', 'Well Above Normal(15-70)']
})

# Create a bar plot using Plotly Express
fig = px.bar(cigarette_data,
             x='Cigarettes Smoked per Day',
             y='Cardiac Disease Risk',
             color='Cigarettes Smoked per Day')

# Annotate the bars with risk percentages
for i, row in cigarette_data.iterrows():
    risk_percentage = row['Cardiac Disease Risk']
    fig.add_annotation(
        x=row['Cigarettes Smoked per Day'],
        y=risk_percentage,
        text=f'Risk: {risk_percentage:.2%}',  # Display risk as percentage
        showarrow=True,
        arrowhead=2,
        arrowcolor='black'
    )

# Customize the layout
fig.update_layout(
    title_text='Cardiac Disease Risk Based on Daily Cigarette Consumption',  # Updated title
    title_x=0.5,  # Center the title
    xaxis_title='Number of Cigarettes Smoked per Day',
    yaxis_title='Cardiac Disease Risk',
    yaxis_ticksuffix='%'
)

# Show the plot
fig.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

The chart you've created is a bar chart displaying the cardiac disease risk based on daily cigarette consumption. In this chart, the x-axis represents the categories of cigarette consumption ('Do not smoke', 'Normal', 'Above Normal', 'Well Above Normal'), the y-axis represents the cardiac disease risk (expressed as a percentage), and each bar represents a category of cigarette consumption. Annotations on the bars provide the exact risk percentage corresponding to each category.

The choice of a bar chart for this visualization is appropriate for several reasons:

**Comparison of Discrete Categories:** Bar charts are excellent for comparing discrete categories or groups of data. In this case, you're comparing different levels of cigarette consumption, making a bar chart an intuitive choice.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

**Insights from the Chart:**

**Risk Variation with Cigarette Consumption:**

The chart provides a clear visual representation of how cardiac disease risk increases with the number of cigarettes smoked per day.
Individuals who do not smoke have the lowest cardiac disease risk, while those categorized as 'Well Above Normal' in daily cigarette consumption have the highest risk.

**Gradual Increase in Risk:**

There seems to be a gradual increase in cardiac disease risk as the number of cigarettes smoked per day rises. This suggests a dose-response relationship between smoking and the risk of cardiovascular diseases.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

'Presence_or_absence_of_cardiovascular_disease' based on the number of cigarettes smoked per day and then visualizes this data using a bar chart. The chart aims to show the cardiac disease risk associated with different levels of daily cigarette consumption.

### Positive Business Impact:

1. **Smoking Cessation Programs:** If the analysis shows a significant increase in cardiac disease risk for individuals who smoke more, this insight can be used by businesses and health organizations to emphasize the importance of smoking cessation programs. Companies offering such programs can target heavy smokers and potentially reduce the risk of cardiovascular diseases among their employees.

2. **Health Insurance and Wellness Programs:** Insurance companies and employers providing health insurance can utilize this information to encourage healthier lifestyles among policyholders and employees. This insight might lead to the development of wellness programs specifically tailored towards reducing smoking habits, potentially leading to better health outcomes and lower insurance claims related to cardiovascular diseases.

3. **Pharmaceutical and Healthcare Industries:** Pharmaceutical companies might find this data valuable for developing smoking cessation medications. Additionally, healthcare providers can use this information to counsel patients about the health risks associated with smoking, potentially leading to increased patient engagement and compliance with prescribed treatments.

### Negative Business Impact:

1. **Increased Healthcare Costs:** If a large number of individuals in a workforce or customer base are heavy smokers, there might be an increase in healthcare costs due to a higher prevalence of cardiovascular diseases. This could impact businesses providing health insurance to their employees or customers, leading to higher premiums or increased claims, potentially affecting the company's bottom line negatively.

2. **Reduced Workforce Productivity:** Employees with cardiovascular diseases might experience reduced productivity due to health-related issues, leading to increased absenteeism or presenteeism (being present at work but not fully productive). This can impact the overall productivity and efficiency of a business.

3. **Potential Legal and Ethical Issues:** In some jurisdictions, businesses might face legal or ethical challenges if they are aware of the health risks among their employees (such as high rates of cardiovascular diseases due to smoking) but do not take appropriate actions to mitigate these risks, such as providing smoking cessation programs or health education.

In summary, the insights gained from the analysis can have both positive and negative impacts on businesses. Utilizing these insights to promote healthier lifestyles, develop targeted interventions, and encourage smoking cessation can lead to positive outcomes. However, ignoring these insights or failing to address the associated risks might result in negative consequences, such as increased healthcare costs and reduced workforce productivity. Business strategies should be designed with a focus on employee well-being and health to mitigate potential negative impacts.


### **Chart - 10**
### **Correlation Heatmap**


In [None]:
# Visualization code

# Columns to drop from the DataFrame for correlation analysis
columns_to_drop = ['id',]

# Create a new DataFrame by dropping specified columns
filtered_df = hd_df.drop(columns=columns_to_drop)

# Calculate the correlation matrix for the filtered DataFrame
correlation_matrix = filtered_df.corr()

# Get the index of top correlated features from the correlation matrix
top_corr_features = correlation_matrix.index

# Set the size of the heatmap figure
plt.figure(figsize=(12, 10))

# Plot a heatmap of correlations between the top features
# Use Seaborn's heatmap function ('sns.heatmap') with annotations and a specific color map ('RdYlGn')
heatmap = sns.heatmap(filtered_df[top_corr_features].corr(), annot=True, cmap='RdYlGn')

# Set the title and labels for the heatmap
heatmap.set_title('Correlation Heatmap', fontsize=18)
plt.xlabel('Features', fontsize=14)
plt.ylabel('Features', fontsize=14)

# Display the heatmap
plt.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

The code you provided creates a correlation heatmap to visualize the correlations between different features in a dataset. This visualization is commonly used in data analysis to understand how variables are related to each other.

#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**


The provided code snippet generates a correlation heatmap based on the specified dataset, highlighting the relationships between different features. Correlation heatmaps are excellent tools for visualizing correlations between variables, especially in datasets with multiple features.

**Strength of Correlation:**

The heatmap shows how strongly different features are correlated with each other. Positive values (closer to 1) indicate a positive correlation, while negative values (closer to -1) indicate a negative correlation.
Feature Relationships:

Features with a strong positive correlation (close to 1) suggest that they increase or decrease together.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

The correlation matrix through a heatmap visualization can provide valuable insights into relationships between different variables in the dataset. Positive correlations (values close to 1) indicate that as one variable increases, the other variable tends to increase as well. Negative correlations (values close to -1) indicate that as one variable increases, the other variable tends to decrease. Correlations close to 0 suggest a weak or no linear relationship between variables.

**Positive Business Impacts:**

**Identifying Positive Correlations:**

Positive correlations between lifestyle factors (e.g., physical activity, healthy diet) and cardiovascular health indicators (e.g., lower cholesterol levels, healthy blood pressure) could suggest that promoting healthy lifestyles among customers might lead to positive health outcomes. Businesses in the health and wellness industry could leverage this insight for marketing strategies, promoting fitness products, and health-related services.



## **Chart -11**
## **Pair Plot**

In [None]:
# Pair Plot
sns.pairplot(hd_df, hue="Presence_or_absence_of_cardiovascular_disease")
plt.show()

#### **1. Why did you pick the specific chart?**

**Answer Here.**

A pairplot, often referred to as a scatterplot matrix, serves as a visualization method enabling the examination of connections between every pair of variables within a dataset. This visualization tool proves valuable for data exploration as it provides a swift and comprehensive means to discern the interrelationships among all variables present in the dataset.

Therefore, we employed a pair plot to scrutinize data patterns and unveil the associations among different features. It essentially accomplishes the same purpose as a correlation map but does so through graphical representation, allowing for a visual exploration of the data's intricate relationships.

#### **2. What is/are the insight(s) found from the chart?**

**Answer Here.**

The distribution of the "cigs_per_day" data is significantly skewed, and it includes a substantial number of values that are equal to zero. Therefore, it may be advisable to transform this data into a categorical column.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

####  **Null Hypothesis:-** The level of education does not appear to be linked to the outcome of coronary heart disease (CHD).
####  **Alternate Hypothesis:-** A correlation exists between the level of education and the outcome of coronary heart disease (CHD).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value.
from scipy.stats import chi2_contingency

# create cotingency table.
contingency_table = pd.crosstab(hd_df['education'] , hd_df['Presence_or_absence_of_cardiovascular_disease'])

# Perform chi-squared test.
chi1,p,dof,expected = chi2_contingency(contingency_table)

# print cotingency table.
print(contingency_table)

# print p-value.
print(f'p-value: {p}')

#### The p-value is well below 0.05, leading us to reject the null hypothesis decisively.

#### Which statistical test have you done to obtain P-Value?

In order to investigate whether the 'education' column has any influence on the occurrence of chronic heart disease (CHD), I conducted a chi-squared test for independence. This statistical analysis helped me assess whether there was a meaningful connection between the level of education and the presence of CHD. By computing the chi-squared statistic and examining the associated p-value, I could draw a statistical conclusion regarding the relationship between these two variables within our dataset.

##### Why did you choose the specific statistical test?

I opted for the chi-squared test of independence to examine whether the 'education' column has an impact on the occurrence of chronic heart disease (CHD). This choice was made because this statistical test is suitable for assessing the presence of a meaningful connection between two categorical variables. In our case, both education levels and CHD outcomes are categorical variables, making the chi-squared test an appropriate method.

The chi-squared test operates by comparing the actual frequency distribution of data in a contingency table with the expected frequency distribution, assuming the null hypothesis is true. If there is a noteworthy disparity between the observed and expected frequencies, it implies a relationship between the two variables.

In summary, I selected the chi-squared test of independence due to its widespread use and established reputation for examining the link between two categorical variables. This allowed me to draw a statistical inference regarding the relationship between education level and CHD outcome within our dataset.

Answer Here.

##### Which statistical test have you done to obtain P-Value?

Answer Here.
The p-value is significantly lower than 0.05 so we reject the null hypothesis.

##### Why did you choose the specific statistical test?

Answer Here.
I conducted a chi-squared test of independence to assess whether the 'education' column has any influence on the occurrence of chronic heart disease (CHD). This statistical analysis enabled me to investigate if there's a noteworthy connection between the level of education and the outcome of CHD. By computing the chi-squared statistic and p-value, I gained the ability to draw a statistical conclusion regarding the association between these two variables in our dataset.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Already handled

In [None]:
# Check the count of missing values in each column of the 'hd_df' DataFrame
missing_value_counts = hd_df.isnull().sum()
print(missing_value_counts)

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments.
# Already handled.

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.
I have used the Interquartile range(IQR) method to identift and remove outliers in the continuous columns of the dataset.I chose to use this technique because this is robust method to detect the outliers that is not affected by the presence of extreme values. The IQR is calculated as the 75th and 25th percentile of the data, and any value that falls between 25th percentile minus 1.5 times the IQR or above the 75th percentile plus 1.5 times the IQR is considered an outlier. By using this method.I was able to identify and remove outliers in a consistent and objective manner.

### 3. Categorical Encoding

In [None]:
# Find the mode of the 'education' column in the hd_df DataFrame.
# The mode represents the most frequently occurring value in the 'education' column.
v = hd_df['education'].mode()

# Calculate the median of the 'Cigrets_smoked_per_day' column in the hd_df DataFrame.
# The median is the middle value of a dataset when it is ordered from smallest to largest.
s = hd_df['Cigrets_smoked_per_day'].median()
# Convert the elements of array 'v' to integers using the 'astype' method.
# This operation ensures that all elements in the array are of integer data type.
v = v.astype(int)

# Convert the elements of array 's' to integers using the 'astype' method.
# This operation ensures that all elements in the array are of integer data type.
s = s.astype(int)

In [None]:
# Transform your categorical columns into a suitable format for computational analysis.
hd_df = pd.get_dummies(hd_df, columns=['education'])

In [None]:
hd_df_copy=hd_df.copy()

In [None]:
hd_df_copy.rename(columns={'Weather_taking_BP_meds_or_not':'On_BP_meds',
 'Presence_or_absence_of_cardiovascular_disease':'cardiovascular_disease',
 'Patient_has_diabeties_or_not':'Diabeties',
 'If_the_patient_has_a_history_of_hypertension':'Had_Hypertension',
 'Weather_smoking_or_not':'Person_smokes',
 'If_the_patient_has_a_history_of_stroke':'Had_stroke'},
                inplace=True)

In [None]:
categorical_columns=list(set(hd_df.columns)-set(continuous_variable))

In [None]:
categorical_variables=list(set(hd_df_copy.columns)-set(continuous_variable))

In [None]:
type(categorical_variables)

In [None]:
from pandas.core.internals.blocks import Categorical
# List of items we want to remove from the categorical_columns list
items_to_remove = ['diff_sys_dis', 'Cholesterol Category', 'Cigrets_smoked_per_day_category','id']
# Using list comprehension to remove specified items from the list
categorical_variables = [col for col in categorical_variables if col not in items_to_remove]

In [None]:
categorical_variables

In [None]:
# Set the size of the plot.
plt.figure(figsize=(15, 10))

# Define the number of rows and columns for subplots.
rows = 4
cols = 3
count = 1

# List of binary categorical columns to be visualized.
categorical_variables

# Labels for binary categories.
Sex_label = ['Females', 'Males']
Labels = ['No','Yes']
# Loop through each binary categorical variable and create a subplot.
for idx, var in enumerate(categorical_variables):
    # Create subplots.
    plt.subplot(rows, cols, count)
    if var == 'gender':
        hd_df_copy[var].value_counts().plot.pie(autopct='%1.1f%%', fontsize=12, labels=Sex_label, colors=['skyblue', 'lightcoral'])
    else:
        # Generate a pie chart for the current categorical variable.
        hd_df_copy[var].value_counts().plot.pie(autopct='%1.1f%%', fontsize=12, labels=Labels, colors=['skyblue', 'lightcoral'])

    # Increase the count for the next subplot.
    count += 1

# Adjust layout to prevent overlapping of subplots.
plt.tight_layout()

# Display the plots.
plt.show()


#### What all categorical encoding techniques have you used & why did you use those techniques?

#### **Answer Here.**
#### Onehot encoding is used to encode the education column.All the remaining cateorical columns are binary(0/1) so no need to encode them.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Refine features to reduce the correlation between them and generate novel features.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_varnc_inflasn_factr(df):
  varnc_inflasn_factr = pd.DataFrame()
  varnc_inflasn_factr['colmns_nam'] = df.columns
  varnc_inflasn_factr['varnc_inflasn_factr'] = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]

  return(varnc_inflasn_factr)

In [None]:
# Choose your features thoughtfully to prevent overfitting.
cont_var_df = pd.DataFrame(hd_df[continuous_variable])
cont_var_df

In [None]:
calculate_varnc_inflasn_factr(hd_df[[c for c in cont_var_df]])

In [None]:
# Creating a new column pulse_pressure and dropping systolic_bp and diastolic_bp.
hd_df['diff_sys_dis'] = (hd_df['Systolic_BP_measure']-hd_df['Diastolic_BP_measure'])/2

In [None]:
# columns
hd_df.columns

In [None]:
new_column = hd_df['diff_sys_dis']

# Concatenate the new column to 'continuous_variable'
cont_var_df = pd.concat([hd_df , new_column], axis=1)

In [None]:
# Updating the continuous_var list

continuous_variable.remove('Systolic_BP_measure')
continuous_variable.remove('Diastolic_BP_measure')
continuous_variable.append('diff_sys_dis')

#### 2. Feature Selection

In [None]:
categorical_columns

In [None]:
# Columns to be removed from the list
columns_to_remove = ['id', 'Weather_smoking_or_not', 'Cholesterol Category', 'Cigrets_smoked_per_day_category']

# Remove specified columns from the list of categorical columns
categorical_columns = [column for column in categorical_columns if column not in columns_to_remove]


In [None]:
hd_df.drop(columns=columns_to_remove,inplace=True)

In [None]:
#dropping excess columns
hd_df.drop(columns=['Systolic_BP_measure','Diastolic_BP_measure'],axis=1,inplace=True)

In [None]:
cont_var_df = pd.DataFrame(hd_df[continuous_variable])

In [None]:
calculate_varnc_inflasn_factr(hd_df[[c for c in cont_var_df]])

In [None]:
corr = hd_df[continuous_variable].corr()
plt.figure(figsize=(25,10))
sns.heatmap(corr,annot=True, cmap=plt.cm.Accent_r)

##### What all feature selection methods have you used  and why?

We employed the variance inflation factor (VIF) as a tool to address multicollinearity. Our analysis revealed that both systolic and diastolic blood pressure exhibited elevated VIF values. Consequently, we introduced a novel variable, known as diff_sys_dis.

Furthermore, an observation was made regarding the "is smoking" column, which solely contained binary values indicating smoking status (yes or no). This same information was redundantly conveyed in the "cigs per day" column, where a value of 0 denoted non-smokers and a numeric value represented the daily cigarette consumption for smokers.

##### Which all features you found important and why?


The key columns include 'age', 'sex', 'cigarettes per day', 'blood pressure medications', 'history of stroke', 'prevalent hypertension', 'diabetes status', 'total cholesterol level', 'body mass index (BMI)', 'heart rate', 'glucose level', 'ten-year coronary heart disease risk', 'education level 1', 'education level 2', 'education level 3', 'education level 4', and 'pulse pressure'. These columns encompass demographic details, behavioral patterns, existing medical conditions, and historical health data.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Asymmetry in relation to the index axis.
(hd_df[continuous_variable]).skew(axis=0)

In [None]:
# Skewness when applying the square root transformation.
np.sqrt(hd_df[continuous_variable]).skew(axis=0)

In [None]:
# Skewness after applying a logarithm base 10 transformation.
np.log10(hd_df[continuous_variable] + 1).skew(axis=0)

In [None]:
# Applying a logarithmic transformation to a continuous variable.

# Apply log transformation to the 'Age' column using np.log10
hd_df['Age'] = np.log10(hd_df['Age'] + 1)

# Apply square root transformation to the 'Cigrets_smoked_per_day' column using np.sqrt
hd_df['Cigrets_smoked_per_day'] = np.sqrt(hd_df['Cigrets_smoked_per_day'])

# Apply log transformation to the 'Cholestrol_measure' column using np.log10
hd_df['Cholesterol_measure'] = np.log10(hd_df['Cholesterol_measure'] + 1)

# Apply square root transformation to the 'Body_Mass_Index' column using np.sqrt
hd_df['Body_Mass_Index'] = np.sqrt(hd_df['Body_Mass_Index'])

# Apply log transformation to the 'Heart_Rate_measure' column using np.log10
hd_df['Heart_Rate_measure'] = np.log10(hd_df['Heart_Rate_measure'] + 1)

# Apply square root transformation to the 'glucose' column using np.sqrt
hd_df['glucose'] = np.sqrt(hd_df['glucose'])

In [None]:
# Assessing the asymmetry following the application of a logarithmic transformation.
hd_df[continuous_variable].skew(axis=0)

Certainly, the data should be adjusted since it exhibited skewness.
We applied logarithmic and square root transformations to various continuous columns in order to mitigate the data's skewed distribution.

### 6. Data Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Assuming 'continuous_variable' is a list of column names representing continuous variables in hd_df

# Create an instance of the MinMaxScaler class
scaler = MinMaxScaler()
features = [i for i in hd_df.columns if i not in ['Presence_or_absence_of_cardiovascular_disease']]

In [None]:
features

In [None]:
continuous_variable

In [None]:
hd_df[continuous_variable] = scaler.fit_transform(hd_df[continuous_variable])

In [None]:
# defining the x and y
# Separating the target variable 'TenYearCHD' from the dataset and creating the feature matrix 'X'.
y = hd_df['Presence_or_absence_of_cardiovascular_disease']  # Target variable: Whether an individual develops coronary heart disease in the next ten years.
X = hd_df.drop(['Presence_or_absence_of_cardiovascular_disease'], axis=1)  # Feature matrix: Excluding 'Presence_or_absence_of_cardiovascular_disease' and 'id' columns for model training.

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.
Dimensionality reduction is unnecessary as we have already reduced the quantity of features, retaining only the essential ones.

In [None]:
# DImensionality Reduction (If needed)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=3697,stratify=y,shuffle=True)
y_train.value_counts()

##### What data splitting ratio have you used and why?

Answer Here.
For model training, we divided our dataset into two parts: the training set and the testing set. This division was achieved using the "train_test_split" technique. Our data was divided in such a way that 80% was allocated for training, and the remaining 20% was reserved for testing. This distribution strikes a suitable balance between providing enough data for effective model training and having a sufficient amount of data for evaluating the model's performance on data it hasn't seen before. Allocating 80% for training ensures the model learns from a substantial dataset, and the remaining 20% serves as a means to gauge how well the model can generalize its learnings to new, unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.
Yes, the dataset is imbalanced and the number of positive cases is very low compared to the negative cases.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Assuming you have already imported y_train and it's a pandas Series
value_counts = y_train.value_counts()

# Create a bar plot
ax = value_counts.plot(kind='bar', title='Target variable before SMOTE')

# Annotate each bar with its count
for i, v in enumerate(value_counts):
    ax.text(i, v, str(v), ha='center', va='bottom')


# Improve visibility of x-axis labels
ax.set_xticklabels(['No Cardiovascular Disease', 'Cardiovascular Disease'], fontsize=12,rotation=0)

# Show the plot
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.


In [None]:
# Oversapling using SMOTETomek
# fit predictor and and target variable
x_smote,y_smote = SMOTETomek(random_state=0).fit_resample(X_train,y_train)

print('Sample in the original dataset:' , len(y_train))
print('Sample in the resampled dataset:' , len(y_smote))

In [None]:
import matplotlib.pyplot as plt

# Assuming y_smote is a pandas Series or DataFrame
value_counts = y_smote.value_counts()

# Set plot style
plt.style.use('ggplot')

# Create a bar plot with customized properties
ax = value_counts.plot(kind='bar', color='skyblue', width=0.6, figsize=(8, 6))

# Annotate each bar with its count and adjust text properties
for p in ax.patches:
    ax.annotate(f'{p.get_height()}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=12, color='black', xytext=(0, 5), textcoords='offset points')

# Add X and Y axis labels
plt.xlabel('Target Classes', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Improve visibility of x-axis labels and set rotation to 0 degrees
ax.set_xticklabels(['No Cardiovascular Disease', 'Cardiovascular Disease'], fontsize=12, rotation=0)

# Add a title
plt.title('Distribution of Target Variable after SMOTE', fontsize=16)

# Add grid lines for reference
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()




*   **Addressing class imbalance involves increasing the representation of minority class samples through oversampling with SMOTE. Afterward, TOMEK links are eliminated to refine the balance. The process concludes with an assessment of class distribution before and after these steps to ensure balanced representation.**



## ***7. ML Model Implementation***

### **Here we will be experimenting with 3 algorithms**
## 1. Logistic regression.

## 2. Decision Tree.

## 3. Random Forest.

## 4. SVM (Support Vector Machine).

## 5. Xtreme Gradient Boosting.

## 6. Naive Bayes.

##7. Neural Network.

## 8. Selection of best model.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report, accuracy_score

def evaluate_classification_model(classification_model, X_train, X_test, y_train, y_test):
    """
    Evaluates a classification machine learning model using various metrics and visualizations.

    Parameters:
        classification_model (object): The classification model to be evaluated.
        X_train, X_test (DataFrame): Feature datasets for training and testing.
        y_train, y_test (Series): Target datasets for training and testing.
        feature_names (list): List of feature names.

    Returns:
        model_metrics (list): List of evaluation metrics for the model.
    """

    # Fit the classification model on the training data
    classification_model.fit(X_train, y_train)

    # Make predictions on the training and test data
    y_pred_train = classification_model.predict(X_train)
    y_pred_test = classification_model.predict(X_test)

    # Predict probabilities for ROC curve
    pred_prob_train = classification_model.predict_proba(X_train)[:, 1]
    pred_prob_test = classification_model.predict_proba(X_test)[:, 1]

    # Calculate ROC AUC score for training and test sets
    roc_auc_train = roc_auc_score(y_train, y_pred_train)
    roc_auc_test = roc_auc_score(y_test, y_pred_test)
    print("\nTrain ROC AUC:", roc_auc_train)
    print("Test ROC AUC:", roc_auc_test)

    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    fpr_train, tpr_train, thresholds_train = roc_curve(y_train, pred_prob_train)
    fpr_test, tpr_test, thresholds_test = roc_curve(y_test, pred_prob_test)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr_train, tpr_train, label="Train ROC AUC: {:.2f}".format(roc_auc_train))
    plt.plot(fpr_test, tpr_test, label="Test ROC AUC: {:.2f}".format(roc_auc_test))
    plt.legend()
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.show()

    # Calculate confusion matrix for training and test sets
    confusion_matrix_train = confusion_matrix(y_train, y_pred_train)
    confusion_matrix_test = confusion_matrix(y_test, y_pred_test)

    # Plot confusion matrices
    fig, ax = plt.subplots(1, 2, figsize=(12, 4))
    print("\nConfusion Matrix:")

    # Plot confusion matrix for training set
    sns.heatmap(confusion_matrix_train, annot=True, fmt='d', cmap="Blues", ax=ax[0])
    ax[0].set_xlabel("Predicted Label")
    ax[0].set_ylabel("True Label")
    ax[0].set_title("Train Confusion Matrix")

    # Plot confusion matrix for test set
    sns.heatmap(confusion_matrix_test, annot=True, fmt='d', cmap="Blues", ax=ax[1])
    ax[1].set_xlabel("Predicted Label")
    ax[1].set_ylabel("True Label")
    ax[1].set_title("Test Confusion Matrix")

    plt.tight_layout()
    plt.show()

    # Calculate classification report for training and test sets
    classification_report_train = classification_report(y_train, y_pred_train, output_dict=True)
    classification_report_test = classification_report(y_test, y_pred_test, output_dict=True)
    print("\nTrain Classification Report:")
    print(pd.DataFrame(classification_report_train).transpose())
    print("\nTest Classification Report:")
    print(pd.DataFrame(classification_report_test).transpose())

    # Convert the nested dictionary to a flat DataFrame for both training and test reports
    df_train = pd.DataFrame(classification_report_train).transpose()
    df_test = pd.DataFrame(classification_report_test).transpose()

    # Create a subplot grid
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    # Plot heatmap for training set classification report
    sns.heatmap(data=df_train, annot=True, cmap='coolwarm', fmt=".2f", ax=axes[0])
    axes[0].set_title("Train Classification Report")
    axes[0].set_xlabel("Metrics")
    axes[0].set_ylabel("Classes")

    # Plot heatmap for test set classification report
    sns.heatmap(data=df_test, annot=True, cmap='coolwarm', fmt=".2f", ax=axes[1])
    axes[1].set_title("Test Classification Report")
    axes[1].set_xlabel("Metrics")
    axes[1].set_ylabel("Classes")

    # Adjust layout
    plt.tight_layout()

    # Display the heatmaps
    plt.show()

    # Check if the classification model has feature importances attribute
    if hasattr(classification_model, 'feature_importances_'):
        # Get feature importances
        feature_importance = classification_model.feature_importances_
        # Create a Series for feature importances and sort it
        feature_importance_series = pd.Series(feature_importance, index=features)
        feature_importance_series = feature_importance_series.sort_values(ascending=False)
        # Plot feature importances
        plt.figure(figsize=(10, 6))
        feature_importance_series[:15].plot(kind='barh', color='skyblue')
        plt.title('Top 15 Feature Importances')
        plt.xlabel('Relative Importance')
        plt.ylabel('Features')
        plt.show()
    else:
        print("\nThe classification model does not have a feature importances attribute.")

    # Calculate additional evaluation metrics
    precision_train = classification_report_train['1']['precision']
    precision_test = classification_report_test['1']['precision']

    recall_train = classification_report_train['1']['recall']
    recall_test = classification_report_test['1']['recall']

    accuracy_train = accuracy_score(y_true=y_train, y_pred=y_pred_train)
    accuracy_test = accuracy_score(y_true=y_test, y_pred=y_pred_test)

    f1_score_train = classification_report_train['1']['f1-score']
    f1_score_test = classification_report_test['1']['f1-score']

    # Store the evaluation metrics in a list
    model_metrics = [precision_train, precision_test, recall_train, recall_test, accuracy_train, accuracy_test, roc_auc_train, roc_auc_test, f1_score_train, f1_score_test]

    # Return the list of evaluation metrics
    return model_metrics


### ML Model - 1

## **First ML Model**

## **Logistic Regression**

In [None]:
# Columns to drop from the DataFrame for correlation analysis
columns_to_drop = ['id',]

# Create a new DataFrame by dropping specified columns
filtered_df = data_set.drop(columns=columns_to_drop)

# Calculate the correlation matrix for the filtered DataFrame
correlation_matrix = filtered_df.corr()

# Get the index of top correlated features from the correlation matrix
top_corr_features = correlation_matrix.index

# Set the size of the heatmap figure
plt.figure(figsize=(12, 10))

# Plot a heatmap of correlations between the top features
# Use Seaborn's heatmap function ('sns.heatmap') with annotations and a specific color map ('RdYlGn')
heatmap = sns.heatmap(filtered_df[top_corr_features].corr(), annot=True, cmap='RdYlGn')

# Set the title and labels for the heatmap
heatmap.set_title('Correlation Heatmap', fontsize=18)
plt.xlabel('Features', fontsize=14)
plt.ylabel('Features', fontsize=14)

# Display the heatmap
plt.show()

In [None]:
calculate_varnc_inflasn_factr(hd_df[[c for c in cont_var_df]])# Calculate VIF for the selected continuous variables using the defined function

In [None]:
# Calculate the correlation matrix for the specified continuous variable in the DataFrame (hd_df)
corr = hd_df[continuous_variable].corr()

# Set the size of the heatmap figure for better visualization
plt.figure(figsize=(25, 10))

# Plot a heatmap of correlations for the continuous variable
# Use Seaborn's heatmap function ('sns.heatmap') with annotations ('annot=True') and a specific color map ('cmap=plt.cm.Accent_r')
sns.heatmap(corr, annot=True, cmap=plt.cm.Accent_r)

# Add annotations to the heatmap, showing correlation values in each cell for better insights
# Correlation values closer to 1 indicate a strong positive correlation, while values closer to -1 indicate a strong negative correlation
# The color intensity represents the strength of the correlation, with lighter colors indicating stronger correlations

# Display the heatmap visualization to explore the correlation relationships among the variables


In [None]:
# Create a Logistic Regression classifier with specified parameters
classification_model = LogisticRegression(fit_intercept=True, max_iter=10000)

# Call the custom function 'evaluate_classification_model' to evaluate the Logistic Regression model
# Pass the classifier, SMOTE transformed training features (x_smote), original test features (X_test),
# SMOTE transformed training labels (y_smote), and original test labels (y_test)
# Returns a list of classification evaluation metrics
LR_classification_metrics = evaluate_classification_model(classification_model, x_smote, X_test, y_smote, y_test)


## **2. Cross- Validation & Hyperparameter Tuning**

In [None]:
param_grid = {'C': [100,10,1,0.1,0.01,0.001,0.0001],
              'penalty': ['l1', 'l2'],
              'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Initializing the logistic regression model
logreg = LogisticRegression(fit_intercept=True, max_iter=10000, random_state=0)

# repeated stratified kfold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=4, random_state=0)

# Using GridSearchCV to tune the hyperparameters using cross-validation
grid = GridSearchCV(logreg, param_grid, cv=rskf)
grid.fit(x_smote, y_smote)

best_params = grid.best_params_
# The best hyperparameters found by GridSearchCV
print("Best hyperparameters: ", best_params)

In [None]:
# Initiate model with best parameters
LR_classification_metrics1 = LogisticRegression(C=best_params['C'],
                                  penalty=best_params['penalty'],
                                  solver=best_params['solver'],
                                  max_iter=10000, random_state=0)

In [None]:
# Visualizing evaluation Metric Score chart
LR_classification_metrics2 = evaluate_classification_model(LR_classification_metrics1, x_smote, X_test, y_smote, y_test)

**Which hyperparameter optimization technique have you used and why?**

The method employed for optimizing hyperparameters is GridSearchCV. GridSearchCV involves a thorough exploration across a defined set of parameters to identify the most suitable ones for a model. It's widely used due to its simplicity and effectiveness in finding optimal hyperparameters.

Selecting an appropriate hyperparameter optimization method depends on factors like the complexity of the parameter options, available computational power, and time limitations. GridSearchCV is particularly useful when the parameter space isn't overly large, and computational resources are reasonably abundant.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
# Create a score dataframe
score_classification = pd.DataFrame(index = ['Precision Train', 'Precision Test','Recall Train','Recall Test','Accuracy Train', 'Accuracy Test','ROC-AUC Train', 'ROC-AUC Test','F1 macro Train', 'F1 macro Test'])

# Visualizing evaluation Metric Score chart
lr_score = LR_classification_metrics

score_classification['Logistic regression'] = lr_score


# Visualizing evaluation Metric Score chart
lr_score2 = LR_classification_metrics2

score_classification['Logistic regression tuned'] = lr_score2

In [None]:
score_classification

It seems that adjusting the hyperparameters did not enhance the Logistic Regression model's performance on the test data. The precision, recall, accuracy, ROC-AUC, and F1 scores remain unchanged between the original and optimized Logistic Regression models when evaluated on the test set.

## **Second ML Model**
## **Decision Tree**


**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# ML Model - 3 Implementation
dt_classification = DecisionTreeClassifier(random_state=20)

In [None]:
# Visualizing evaluation Metric Score chart
dt_classification_score = evaluate_classification_model(dt_classification, x_smote, X_test, y_smote, y_test)

### **2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# Define the hyperparameter grid
grid = {'max_depth' : [3,4,5,6,7,8],
        'min_samples_split' : np.arange(2,8),
        'min_samples_leaf' : np.arange(10,20)}

# Initialize the model
model = DecisionTreeClassifier()

# repeated stratified kfold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize GridSearchCV
grid_search = GridSearchCV(model, grid, cv=rskf)

# Fit the GridSearchCV to the training data
grid_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = grid_search.best_params_
best_params

In [None]:
# Train a new model with the best hyperparameters
dt2_classification = DecisionTreeClassifier(max_depth=best_params['max_depth'],
                                 min_samples_leaf=best_params['min_samples_leaf'],
                                 min_samples_split=best_params['min_samples_split'],
                                 random_state=20)

In [None]:
dt2_classification_score = evaluate_classification_model(dt2_classification, x_smote, X_test, y_smote, y_test)

In [None]:
score_classification['Decision Tree'] = dt_classification_score
score_classification['Decision Tree tuned'] = dt2_classification_score

#### ***Which hyperparameter optimization technique have you used and why?***


The approach employed for hyperparameter optimization is GridSearchCV, a method that thoroughly explores a predefined set of parameters to identify the optimal ones for a model. GridSearchCV stands out for its simplicity and effectiveness in finding the best hyperparameters.

Selecting a hyperparameter optimization technique involves considering factors like the parameter space's complexity, available computational resources, and time constraints. GridSearchCV is suitable when the parameter space is manageable in size, and computational resources are not severely limited, making it a practical choice under these conditions.

#### ***Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.***

In [None]:
score_classification

Hyperparameter tuning seems to have enhanced the Decision Tree model's performance on the test set. The optimized Decision Tree model demonstrates improved precision and F1 score compared to the original model. However, the recall, accuracy, and ROC-AUC scores experienced marginal declines after tuning.

Unlike the original model, the tuned version doesn't exhibit overfitting.

## ***Third ML Model***
## ***Random Forest***


In [None]:
# Initialize the model
rf_classification = RandomForestClassifier(random_state=0)

### ***1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.***

In [None]:
# Visualizing evaluation Metric Score chart
rf_classification_score = evaluate_classification_model(rf_classification, x_smote, X_test, y_smote, y_test)

In [None]:
score_classification['Random Forest'] = rf_classification_score

### ***2. Cross- Validation & Hyperparameter Tuning***

In [None]:
# Define the hyperparameter grid
grid = {'n_estimators': [10, 50, 100, 200],
              'max_depth': [8, 9, 10, 11, 12,13, 14, 15],
              'min_samples_split': [2, 3, 4, 5]}

# Initialize the model
rf_classification = RandomForestClassifier(random_state=0)

# repeated stratified kfold
rskf_classification = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomSearchCV
random_classification_search = RandomizedSearchCV(rf_classification, grid,cv=rskf, n_iter=10, n_jobs=-1)

# Fit the GridSearchCV to the training data
random_classification_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = random_classification_search.best_params_
best_params

In [None]:
# Initialize model with best parameters
rf2_classification = RandomForestClassifier(n_estimators = best_params['n_estimators'],
                                 min_samples_leaf= best_params['min_samples_split'],
                                 max_depth = best_params['max_depth'],
                                 random_state=0)

In [None]:
# Visualizing evaluation Metric Score_classification chart
rf2_classification_score = evaluate_classification_model(rf2_classification, x_smote, X_test, y_smote, y_test)

### **Which hyperparameter optimization technique have you used and why?**

The hyperparameter optimization technique used is RandomizedSearchCV. RandomizedSearchCV is a method that performs a random search over a specified parameter grid to find the best hyperparameters for a model. It is a popular method for hyperparameter tuning because it can be more efficient than exhaustive search methods like GridSearchCV when the parameter space is large.

The choice of hyperparameter optimization technique depends on various factors such as the size of the parameter space, the computational resources available, and the time constraints. RandomizedSearchCV can be a good choice when the parameter space is large and computational resources are limited.

#### **Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
score_classification['Random Forest'] = rf_classification_score
score_classification['Random Forest tuned'] = rf2_classification_score
score_classification

The Random Forest model's performance on the test set significantly enhanced after hyperparameter tuning. The tuned model exhibited superior precision, recall, accuracy, and F1 score in comparison to the original, untuned version. Additionally, the ROC-AUC score on the test set also saw a slight improvement after the tuning process.

## **Fourth ML Model**  
## **SVM (Support Vector Machine)**


In [None]:
# Initialize the model
svm_classification = SVC(kernel='linear', random_state=0, probability=True)

### **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score_classification chart
svm_classification_score = evaluate_classification_model(svm_classification, x_smote, X_test, y_smote, y_test)

### **2. Cross- Validation & Hyperparameter Tuning**

In [None]:
param_grid = {'C': np.arange(0.1, 10, 0.1),
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree': np.arange(2, 6, 1)}

# Initialize the model
svm2_classification = SVC(random_state=0, probability=True)

# repeated stratified kfold
rskf_classification = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV with 6-fold cross-validation
random_classification_search = RandomizedSearchCV(svm2_classification, param_grid, n_iter=10, cv=rskf, n_jobs=-1)

# Fit the RandomizedSearchCV to the training data
random_classification_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = random_classification_search.best_params_
best_params

In [None]:
# Initialize model with best parameters
svm2_classification = SVC(C = best_params['C'],
           kernel = best_params['kernel'],
           degree = best_params['degree'],
           random_state=0, probability=True)

In [None]:
# Evaluate the SVM classification model using the provided function.
# The function evaluates the model's performance on the SMOTE oversampled training data (x_smote, y_smote)
# and the unseen test data (X_test, y_test).
# It calculates various classification metrics such as precision, recall, accuracy, ROC AUC, and F1-score.
# It also generates visualizations including ROC curves, confusion matrices, and feature importances (if applicable).

# Call the evaluate_classification_model function with the SVM classification model (svm2_classification)
# and the training and test datasets (x_smote, X_test, y_smote, y_test) along with feature names if needed.
svm2_classification_score = evaluate_classification_model(svm2_classification, x_smote, X_test, y_smote, y_test)


#### **Which hyperparameter optimization technique have you used and why?**

In this approach, Randomized Search is employed. This method is favored because it's often more efficient than exhaustive techniques like grid search. Rather than exploring every conceivable combination of hyperparameters, randomized search randomly selects a subset of the hyperparameter space. This strategy conserves time and computational power while still uncovering effective hyperparameters for the model.

#### **Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
# Add the classification scores for SVM model to the 'score_classification' DataFrame.
score_classification['SVM'] = svm_classification_score

# Add the classification scores for tuned SVM model to the 'score_classification' DataFrame.
score_classification['SVM tuned'] = svm2_classification_score

# Display the updated 'score_classification' DataFrame showing scores for both SVM models.
score_classification


The optimized SVM model, resulting from hyperparameter tuning, demonstrates enhanced performance on the test set. It exhibits improved recall, accuracy, and F1 score in comparison to the original, non-optimized SVM model. Nevertheless, precision and ROC-AUC scores experienced a minor decrease after tuning.

### **ML Model - 5**
### **Xtreme Gradient Boosting**


In [None]:
import xgboost as xgb

# Initialize the model
xgb_classification_model = xgb.XGBClassifier()


### **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
xgb_classification_score = evaluate_classification_model(xgb_classification_model, x_smote, X_test, y_smote, y_test)

### **2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# Define the hyperparameter grid
param_grid = {'learning_rate': np.arange(0.01, 0.3, 0.01),
              'max_depth': np.arange(3, 15, 1),
              'n_estimators': np.arange(100, 200, 10)}

# Initialize the model
xgb2_model = xgb.XGBClassifier(random_state=0)

# repeated stratified kfold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(xgb2_model, param_grid, n_iter=10, cv=rskf)

# Fit the RandomizedSearchCV to the training data
random_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = random_search.best_params_
best_params

In [None]:
# Initialize model with best parameters
xgb2_classification_model = xgb.XGBClassifier(learning_rate = best_params['learning_rate'],
                                 max_depth = best_params['max_depth'],
                               n_estimators = best_params['n_estimators'],
                                 random_state=0)

In [None]:
# Visualizing evaluation Metric Score chart
xgb2_classification_score = evaluate_classification_model(xgb2_classification_model, x_smote, X_test, y_smote, y_test)

### **Which hyperparameter optimization technique have you used and why?**


In this context, we've utilized Randomized Search to optimize the XGB model.

Randomized Search is favored due to its efficiency compared to exhaustive methods like grid search. Rather than exploring every conceivable combination of hyperparameters, randomized search selects a random subset from the hyperparameter space. This approach conserves time and computational power while still identifying effective hyperparameters for the model.


### **Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
# Adding the 'xgb_classification_score' list to the 'score_classification' DataFrame with the key 'XGB'
score_classification['XGB'] = xgb_classification_score

# Adding the 'xgb2_classification_score' list to the 'score_classification' DataFrame with the key 'XGB tuned'
score_classification['XGB tuned'] = xgb2_classification_score

# Displaying the updated 'score_classification' DataFrame, now containing scores for both 'XGB' and 'XGB tuned' models
score_classification


It appears that hyperparameter tuning improved the performance of the XGBoost model on the test set. The tuned XGBoost model has higher precision, recall, accuracy, and F1 score on the test set compared to the untuned XGBoost model. The ROC-AUC score on the test set also improved slightly after tuning.

## **Sixth ML Model**
## **Naive Bayes**


In [None]:
from sklearn.naive_bayes import GaussianNB

# Initiate model
naive_classification = GaussianNB()


## **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
naive_classification_score = evaluate_classification_model(naive_classification, x_smote, X_test, y_smote, y_test)

## **2. Cross- Validation & Hyperparameter Tuning**

In [None]:

# Define the hyperparameter grid
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
# Initialize the model
naive = GaussianNB()

# repeated stratified kfold
rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=4, random_state=0)

# Initialize RandomizedSearchCV
random_search = GridSearchCV(naive, param_grid, cv=rskf, n_jobs=-1)

# Fit the RandomizedSearchCV to the training data
random_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = random_search.best_params_
best_params

In [None]:
# Initiate model with best parameters
naive2_classification = GaussianNB(var_smoothing = best_params['var_smoothing'])

In [None]:
# Visualizing evaluation Metric Score chart
naive2_classification_score = evaluate_classification_model(naive2_classification, x_smote, X_test, y_smote, y_test)

### **Which hyperparameter optimization technique have you used and why?**

Here we have used the gridsearch for optimization of the Naive Bayes model.

Grid search is an exhaustive search method that tries all possible combinations of hyperparameters specified in the hyperparameter grid. This technique can be useful when the number of hyperparameters to tune is small and the range of possible values for each hyperparameter is limited. Grid search can find the best combination of hyperparameters, but it can be computationally expensive for large hyperparameter grids.

### **Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
# Update the 'Naive Bayes' column in the 'score_classification' DataFrame
# with the evaluation metrics obtained from the 'naive_classification_score'
score_classification['Naive Bayes'] = naive_classification_score

# Update the 'Naive Bayes tuned' column in the 'score_classification' DataFrame
# with the evaluation metrics obtained from the 'naive2_classification_score'
score_classification['Naive Bayes tuned'] = naive2_classification_score

# Display the updated 'score_classification' DataFrame with the added columns
score_classification


## **Seventh ML Model**
## **Neural Network**


In [None]:
from sklearn.neural_network import MLPClassifier

# Initialize the MLPClassifier
neural_classification = MLPClassifier(random_state=0)


### **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
neural_classification_score = evaluate_classification_model(neural_classification, x_smote, X_test, y_smote, y_test)

### **2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# Define the hyperparameter grid
param_grid = {'hidden_layer_sizes': np.arange(10, 100, 10),
              'alpha': np.arange(0.0001, 0.01, 0.0001)}
# Initialize the model
neural = MLPClassifier(random_state=0)

# repeated stratified kfold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(neural, param_grid, n_iter=10, cv=rskf, n_jobs=-1)

# Fit the RandomizedSearchCV to the training data
random_search.fit(x_smote, y_smote)

# Select the best hyperparameters
best_params = random_search.best_params_
best_params

In [None]:
# Initiate model with best parameters
neural2_classification = MLPClassifier(hidden_layer_sizes = best_params['hidden_layer_sizes'],
                        alpha = best_params['alpha'],
                        random_state = 0)

In [None]:
# Visualizing evaluation Metric Score chart
neural2_classification_score = evaluate_classification_model(neural2_classification, x_smote, X_test, y_smote, y_test)

### **Which hyperparameter optimization technique have you used and why?**
Here we have used Randomized search to tune the Neural Network model.

Randomized search is a popular technique because it can be more efficient than exhaustive search methods like grid search. Instead of trying all possible combinations of hyperparameters, randomized search samples a random subset of the hyperparameter space. This can save time and computational resources while still finding good hyperparameters for the model.

### **Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

In [None]:
# Adding the evaluation scores for the 'Neural Network' model to the 'score_classification' DataFrame
score_classification['Neural Network'] = neural_classification_score

# Adding the evaluation scores for the tuned 'Neural Network' model to the 'score_classification' DataFrame
score_classification['Neural Network tuned'] = neural2_classification_score

# Displaying the updated 'score_classification' DataFrame, now including scores for both the 'Neural Network' and 'Neural Network tuned' models
score_classification


## **Plot of scores for models**

## **Precision**

In [None]:
# Extracting data for visualization
models = list(score_classification.columns)
train_precision = score_classification.iloc[0, :]
test_precision = score_classification.iloc[1, :]

# Setting the positions for bars on X-axis
X_axis = np.arange(len(models))

# Setting figure size and creating a bar plot
plt.figure(figsize=(12, 6))
bars1 = plt.bar(X_axis - 0.2, train_precision, 0.4, label='Train Precision', color='b', alpha=0.7)
bars2 = plt.bar(X_axis + 0.2, test_precision, 0.4, label='Test Precision', color='g', alpha=0.7)

# Setting X-axis labels, rotating them for better visibility
plt.xticks(X_axis, models, rotation=30, ha='right')

# Setting Y-axis label and plot title
plt.ylabel("Precision Score")
plt.title("Model Comparison: Train vs. Test Precision")

# Adding annotations to the bars
for bar in bars1:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', color='black', fontsize=8)

for bar in bars2:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', color='black', fontsize=8)

# Adding legend to the plot
plt.legend()

# Displaying the plot
plt.tight_layout()  # Ensures the labels fit well in the plot area
plt.show()


In [None]:
# Define the models, train precision scores, and test precision scores
models = list(score_classification.columns)
train_precision = score_classification.iloc[0, :]
test_precision = score_classification.iloc[1, :]

# Set the positions for the bars on the X-axis
X_axis = np.arange(len(models))

# Set the figure size
plt.figure(figsize=(12, 6))

# Plot bars for train and test precision scores
plt.bar(X_axis - 0.2, train_precision, 0.4, label='Train Precision', color='skyblue', edgecolor='grey', linewidth=0.5)
plt.bar(X_axis + 0.2, test_precision, 0.4, label='Test Precision', color='orange', edgecolor='grey', linewidth=0.5)

# Annotate the bars with their respective values
for i in range(len(models)):
    plt.text(X_axis[i] - 0.2, train_precision[i] + 0.01, str(round(train_precision[i], 2)), ha='center')
    plt.text(X_axis[i] + 0.2, test_precision[i] + 0.01, str(round(test_precision[i], 2)), ha='center')

# Set the X-axis labels and rotate them for better visibility
plt.xticks(X_axis, models, rotation=45, ha='right')

# Set labels and title
plt.ylabel("Precision Score")
plt.title("Precision Scores for Different Models")
plt.legend()
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
# Recall Scores plot

# Define the models, train precision scores, and test precision scores
models = list(score_classification.columns)
train_precision = score_classification.iloc[0, :]
test_precision = score_classification.iloc[1, :]

# Set the positions for the bars on the X-axis
X_axis = np.arange(len(models))

# Set the figure size
plt.figure(figsize=(12, 6))

# Plot bars for train and test precision scores
plt.bar(X_axis - 0.2, train_precision, 0.4, label='Train Precision', color='skyblue', edgecolor='grey', linewidth=0.5)
plt.bar(X_axis + 0.2, test_precision, 0.4, label='Test Precision', color='orange', edgecolor='grey', linewidth=0.5)

# Annotate the bars with their respective values
for i in range(len(models)):
    plt.text(X_axis[i] - 0.2, train_precision[i] + 0.01, str(round(train_precision[i], 2)), ha='center')
    plt.text(X_axis[i] + 0.2, test_precision[i] + 0.01, str(round(test_precision[i], 2)), ha='center')

# Set the X-axis labels and rotate them for better visibility
plt.xticks(X_axis, models, rotation=45, ha='right')

# Set labels and title
plt.ylabel("Precision Score")
plt.title("Precision Scores for Different Models")
plt.legend()
plt.tight_layout()

# Show the plot
plt.show()


### **Recall**

In [None]:
# Recall Scores plot

# Data
models = list(score_classification.columns)
train_recall = score_classification.iloc[2, :]  # Assuming score contains relevant data, adjust index if needed
test_recall = score_classification.iloc[3, :]   # Assuming score contains relevant data, adjust index if needed

# Set the positions and width for the bars
X_axis = np.arange(len(models))
bar_width = 0.4

# Create a figure and axis for the plot
plt.figure(figsize=(12, 6))

# Plotting train recall scores
plt.bar(X_axis - bar_width/2, train_recall, bar_width, label='Train Recall', color='skyblue', edgecolor='black')

# Plotting test recall scores
plt.bar(X_axis + bar_width/2, test_recall, bar_width, label='Test Recall', color='orange', edgecolor='black')

# Set x-axis labels and rotate them for better visibility
plt.xticks(X_axis, models, rotation=45, ha='right')

# Adding data values as annotations above the bars
for i in range(len(models)):
    plt.text(X_axis[i] - 0.2, train_recall[i] + 0.01, str(round(train_recall[i], 2)), ha='center', color='b')
    plt.text(X_axis[i] + 0.2, test_recall[i] + 0.01, str(round(test_recall[i], 2)), ha='center', color='orange')

# Set y-axis label and plot title
plt.ylabel("Recall Score")
plt.title("Comparison of Train and Test Recall Scores for Different Models")

# Add legend
plt.legend()

# Show the plot
plt.tight_layout()
plt.show()


### **Accuracy**

In [None]:
# Data for visualization (assuming 'score' is a DataFrame with accuracy scores)
models = list(score_classification.columns)
train = score_classification.iloc[4, :]
test = score_classification.iloc[5, :]

# Set the positions for the bars on X-axis
X_axis = np.arange(len(models))

# Set the size of the figure
plt.figure(figsize=(12, 6))

# Plotting the bars for train and test accuracy
plt.bar(X_axis - 0.2, train, 0.4, label='Train Accuracy', color='b', alpha=0.7)
plt.bar(X_axis + 0.2, test, 0.4, label='Test Accuracy', color='g', alpha=0.7)

# Annotate the bars with accuracy values
for i in range(len(models)):
    plt.text(X_axis[i] - 0.2, train[i] + 0.01, f'{train[i]:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(X_axis[i] + 0.2, test[i] + 0.01, f'{test[i]:.2f}', ha='center', va='bottom', fontsize=10)

# Set X-axis ticks and labels (model names) with rotation for better readability
plt.xticks(X_axis, models, rotation=30)

# Set Y-axis label and plot title
plt.ylabel("Accuracy Score")
plt.title("Accuracy Score Comparison for Different Models")

# Add legend to the plot
plt.legend()

# Show the plot
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()


### **ROC-AUC**

In [None]:
# Define models, train, and test data
models = list(score_classification.columns)
train_scores = score_classification.iloc[6, :]
test_scores = score_classification.iloc[7, :]

# Set the width of the bars and the positions for each model
bar_width = 0.35
r1 = np.arange(len(models))
r2 = [x + bar_width for x in r1]

# Create a figure and axis
plt.figure(figsize=(12, 6))

# Plot train ROC-AUC scores
plt.bar(r1, train_scores, width=bar_width, edgecolor='grey', label='Train ROC-AUC', alpha=0.7)

# Plot test ROC-AUC scores
plt.bar(r2, test_scores, width=bar_width, edgecolor='grey', label='Test ROC-AUC', alpha=0.7)

# Add data labels above the bars
for i in range(len(models)):
    plt.text(i, train_scores[i] + 0.01, f'{train_scores[i]:.2f}', ha='center')
    plt.text(i + bar_width, test_scores[i] + 0.01, f'{test_scores[i]:.2f}', ha='center')

# Customize the plot
plt.xlabel('Models', fontweight='bold')
plt.xticks([r + bar_width / 2 for r in range(len(models))], models, rotation=45, ha='right')
plt.ylabel('ROC-AUC Score', fontweight='bold')
plt.title('ROC-AUC Scores for Each Model', fontweight='bold')
plt.legend()
plt.tight_layout()

# Show the plot
plt.show()


### **F1 score**

In [None]:
# Data
models = list(score_classification.columns)
train_f1_scores = score_classification.iloc[8, :]
test_f1_scores = score_classification.iloc[9, :]
x_ticks_labels = models

# X-axis positions for the bars
X_axis = np.arange(len(models))

# Width of the bars
bar_width = 0.4

# Create a figure and axis
plt.figure(figsize=(12, 6))

# Plotting the bars for train F1 scores
plt.bar(X_axis - bar_width/2, train_f1_scores, bar_width, label='Train F1 macro', color='skyblue', alpha=0.7)

# Plotting the bars for test F1 scores
plt.bar(X_axis + bar_width/2, test_f1_scores, bar_width, label='Test F1 macro', color='orange', alpha=0.7)

# Setting x-ticks and labels
plt.xticks(X_axis, x_ticks_labels, rotation=45, ha='right')

# Adding annotations to the bars
for i, v in enumerate(train_f1_scores):
    plt.text(i - bar_width/2 - 0.03, v + 0.01, str(round(v, 2)), fontsize=10, ha='center', va='bottom')

for i, v in enumerate(test_f1_scores):
    plt.text(i + bar_width/2 + 0.03, v + 0.01, str(round(v, 2)), fontsize=10, ha='center', va='bottom')

# Adding labels and title
plt.ylabel("F1 macro Score")
plt.title("F1 macro scores for each model (Train vs Test)")
plt.legend()
plt.tight_layout()

# Display the plot
plt.show()


# **Selection of best model**

In [None]:
score_classification

In [None]:
# Removing the overfitted models which have recall, rocauc, f1 for train as 1
score_t = score_classification.transpose()            #taking transpose of the score dataframe to create new difference column
remove_models = score_t[score_t['Recall Train']>=0.95].index  #creating a list of models which have 1 for train and score_t['Accuracy Train']==1.0 and score_t['ROC-AUC Train']==1.0 and score_t['F1 macro Train']==1.0
remove_models

adj = score_t.drop(remove_models)                     #creating a new dataframe with required models
adj

In [None]:
def select_best_model(df, metrics):

    best_models = {}
    for metric in metrics:
        max_test = df[metric + ' Test'].max()
        best_model_test = df[df[metric + ' Test'] == max_test].index[0]
        best_model = best_model_test
        best_models[metric] = best_model
    return best_models

In [None]:
metrics = ['Precision','Recall', 'Accuracy', 'ROC-AUC', 'F1 macro']

best_models = select_best_model(adj, metrics)
print("The best models are:")
for metric, best_model in best_models.items():
    print(f"{metric}: {best_model} - {adj[metric+' Test'][best_model].round(4)}")

### **1. Which Evaluation metrics did you consider for a positive business impact and why?**
After thoughtful evaluation of the implications of incorrect results, specifically both false positives and false negatives within the scope of our business goals, I have chosen "recall" as the key measurement for our CHD risk prediction model. This implies our focus is on maximizing accurate identification of individuals with CHD risk (true positives) while minimizing cases where patients with CHD risk are overlooked (false negatives). The objective is to identify the most CHD risk cases correctly, even if it results in a few incorrect positive identifications.

### **2. Which ML model did you choose from the above created models as your final prediction model and why?**
Having assessed various machine learning models using the Framingham Heart Study dataset, I've opted for the SVM as our ultimate predictive model. This choice was made after considering the model's performance using our main evaluation measure, recall. Recall gauges the model's accuracy in identifying patients at risk of CHD. Our analysis revealed that, among the models we tested, the SVM achieved the highest recall score.

We opted for using recall as our main evaluation measure because accurately recognizing patients at risk of coronary heart disease (CHD) is vital for meeting our business goals. Prioritizing a model with a high recall score means our focus is on correctly identifying as many CHD risk patients as we can, even if it results in some false positives. In summary, we are confident that the SVM stands as the most suitable choice for us, ensuring a positive impact on our business objectives.

### **3. Explain the model which you have used and the feature importance using any model explainability tool?**
# **SHAP(Shapley additive Explanations)**

In [None]:
pip install shap


In [None]:
# importing shap
import shap

In [None]:
# summarize the background dataset using k-means clustering
X_summary = shap.kmeans(X, 100)

# create an explainer object
explainer = shap.KernelExplainer(neural_classification.predict_proba, X_summary)

# compute the SHAP values for all the samples in the test data
shap_values = explainer.shap_values(X_test)

In [None]:
# Summery plot
shap.summary_plot(shap_values, X_test, feature_names=features)

This bar chart illustrates significant features along with their average Shap values, representing their average influence on the model's output strength. However, it doesn't indicate whether this impact is positive or negative on the predictions.

# **8.** **Future Work (Optional)**
## **1. Save the best performing ml model in a pickle file or joblib file format for deployment process.**

In [None]:
# Import pickle
import pickle

# Save the best model (naive bayes tuned)
pickle.dump(naive2_classification, open('neural2.pkl', 'wb'))
# Save the scaler
pickle.dump(scaler, open('scaler.pkl', 'wb'))

## **2. Again Load the saved model file and try to predict unseen data for a sanity check.**

In [None]:
# Load the File and predict unseen data.
pickled_model = pickle.load(open('neural2.pkl', 'rb'))

In [None]:
instance = X_test.loc[54]

In [None]:
instance

In [None]:
# create an array for the x test value for the 50 index row
predict_new = np.array(instance).reshape(1,-1)

# Testing on one instance which we used for shap X_test[50,:]
pickled_model.predict(predict_new)

## **The model has been successfully built and is prepared for deployment on a live server, where it can interact with real users.**



# **Conclusion**
In summary, this project showcased the effectiveness of machine learning methods in accurately foreseeing the 10-year risk of coronary heart disease (CHD) in patients. The study utilized data from an ongoing cardiovascular research initiative. Key findings from this endeavor encompass the following:

1. **Refined data preprocessing and transformation significantly enhanced the performance of machine learning models, leading to more precise predictions.**

2. **Careful selection of features played a vital role in identifying the most pertinent indicators of CHD risk.**

3. **To ensure we catch all instances of heart disease in patients, we need a high Recall value. On the other hand, if we want to minimize the chances of mistakenly diagnosing a patient without heart disease, a high Precision value is necessary.**

4. **Suppose we consider patients who were wrongly identified as having heart disease. In our context, these cases are crucial because they might indicate other health issues. Therefore, we aim to strike a balance between Precision and Recall metrics. Achieving a high F1 score is our goal, emphasizing the importance of correctly identifying these cases while minimizing false positives and false negatives.**

5. **The disparity in performance between the training and test sets can be attributed to the introduction of synthetic data points to address the significant class imbalance in the training set. This discrepancy arises from the dissimilarity in data distribution between the training and test sets. Therefore, the impressive model performance on the training set results from the mismatch in data distribution between the two sets, not from overfitting.**

6. **The models exhibited their highest performance on the test data when evaluated using metrics specific to class 1.**

    ### 1. Precision: Naive Bayes - 0.2773.
    ### 2. Recall: SVM - 0.7124.
    ### 3. Accuracy: Naive Bayes - 0.7532.
    ### 4. ROC-AUC: SVM - 0.6745.
    ### 5. F1 macro: SVM - 0.3785.

7. **The SVM model, was chosen as the ultimate predictive model due to its impressive recall score.**

8. **Innovative techniques such as SMOTE combined with Tomek links undersampling and MinMax scalar scaling were implemented to address imbalanced data, resulting in improved model accuracy.**

9. **This project serves as a noteworthy illustration of how machine learning techniques can be applied to real-world challenges, delivering positive outcomes for businesses.**

Overall, this study underscores the significance of meticulous data preparation and analysis in machine learning initiatives. By dedicating effort to cleaning and transforming the data, selecting pertinent features, and opting for suitable models, precise predictions can be achieved, facilitating informed decision-making across various domains.



# ***Thankyou***