<div align="center"><h1>Heart Attack Risk Prediction</h1></div>

# About

This project aims to predict the likelihood of a heart attack using a data set that includes various health and lifestyle factors. The prediction is based on the identification of key features that contribute significantly to the assessment of heart attack risk.

## Features

The dataset includes the following features:

- **Patient ID:** Unique identifier for each patient
- **Age:** Age of the patient
- **Sex:** Gender of the patient (Male/Female)
- **Cholesterol:** Cholesterol levels of the patient
- **Blood Pressure:** Blood pressure of the patient (systolic/diastolic)
- **Heart Rate:** Heart rate of the patient
- **Diabetes:** Whether the patient has diabetes (Yes/No)
- **Family History:** Family history of heart-related problems (1: Yes, 0: No)
- **Smoking:** Smoking status of the patient (1: Smoker, 0: Non-smoker)
- **Obesity:** Obesity status of the patient (1: Obese, 0: Not obese)
- **Alcohol Consumption:** Level of alcohol consumption by the patient (None/Light/Moderate/Heavy)
- **Exercise Hours Per Week:** Number of exercise hours per week
- **Diet:** Dietary habits of the patient (Healthy/Average/Unhealthy)
- **Previous Heart Problems:** Previous heart problems of the patient (1: Yes, 0: No)
- **Medication Use:** Medication usage by the patient (1: Yes, 0: No)
- **Stress Level:** Stress level reported by the patient (1-10)
- **Sedentary Hours Per Day:** Hours of sedentary activity per day
- **Income:** Income level of the patient
- **BMI:** Body Mass Index (BMI) of the patient
- **Triglycerides:** Triglyceride levels of the patient
- **Physical Activity Days Per Week:** Days of physical activity per week
- **Sleep Hours Per Day:** Hours of sleep per day
- **Country:** Country of the patient
- **Continent:** Continent where the patient resides
- **Hemisphere:** Hemisphere where the patient resides
- **Heart Attack Risk (Outcome):** Presence of heart attack risk (1: Yes, 0: No)

## Methodology

The predictive model will focus on selecting the most informative features from the dataset to improve the accuracy of heart attack risk predictions. By analyzing and prioritizing key factors, the model aims to provide valuable information to identify individuals at higher risk of having a heart attack.

## Data Cleaning

Before we start with generating our models, first we should clean the dataset, and eliminate the unnecessary features.

In [302]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('dark_background')

file_path = "heart_attack_prediction_dataset.csv"

df = pd.read_csv(file_path)

df.head()

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,BMW7812,67,Male,208,158/88,72,0,0,1,0,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0
1,CZE1114,21,Male,389,165/93,98,1,1,1,1,...,4.963459,285768,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0
2,BNI9906,21,Female,324,174/99,72,1,0,0,0,...,9.463426,235282,28.176571,587,4,4,France,Europe,Northern Hemisphere,0
3,JLN3497,84,Male,383,163/100,73,1,1,1,0,...,7.648981,125640,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0
4,GFO8847,66,Male,318,91/88,93,1,1,1,1,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0


### Check the shape and columns, drop the duplicates

In [303]:
df.shape

(8763, 26)

In [304]:
df.columns

Index(['Patient ID', 'Age', 'Sex', 'Cholesterol', 'Blood Pressure',
       'Heart Rate', 'Diabetes', 'Family History', 'Smoking', 'Obesity',
       'Alcohol Consumption', 'Exercise Hours Per Week', 'Diet',
       'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day', 'Country',
       'Continent', 'Hemisphere', 'Heart Attack Risk'],
      dtype='object')

Drop the Duplicate values if any

In [305]:
df.drop_duplicates(inplace=True)
df.shape

(8763, 26)

In [306]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 26 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Patient ID                       8763 non-null   object 
 1   Age                              8763 non-null   int64  
 2   Sex                              8763 non-null   object 
 3   Cholesterol                      8763 non-null   int64  
 4   Blood Pressure                   8763 non-null   object 
 5   Heart Rate                       8763 non-null   int64  
 6   Diabetes                         8763 non-null   int64  
 7   Family History                   8763 non-null   int64  
 8   Smoking                          8763 non-null   int64  
 9   Obesity                          8763 non-null   int64  
 10  Alcohol Consumption              8763 non-null   int64  
 11  Exercise Hours Per Week          8763 non-null   float64
 12  Diet                

In [307]:
columns_to_drop = ['Hemisphere', 'Patient ID']

df.drop(columns_to_drop, axis=1, inplace=True)

df.head()

Unnamed: 0,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,...,Stress Level,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk
0,67,Male,208,158/88,72,0,0,1,0,0,...,9,6.615001,261404,31.251233,286,0,6,Argentina,South America,0
1,21,Male,389,165/93,98,1,1,1,1,1,...,1,4.963459,285768,27.194973,235,1,7,Canada,North America,0
2,21,Female,324,174/99,72,1,0,0,0,0,...,9,9.463426,235282,28.176571,587,4,4,France,Europe,0
3,84,Male,383,163/100,73,1,1,1,0,1,...,9,7.648981,125640,36.464704,378,3,4,Canada,North America,0
4,66,Male,318,91/88,93,1,1,1,1,0,...,6,1.514821,160555,21.809144,231,1,5,Thailand,Asia,0


1- Lets start organizing the age column

In [308]:
df['Age'].unique()

array([67, 21, 84, 66, 54, 90, 20, 43, 73, 71, 77, 60, 88, 69, 38, 50, 45,
       36, 48, 40, 79, 63, 27, 25, 86, 42, 52, 29, 30, 47, 44, 33, 51, 70,
       85, 31, 56, 24, 74, 72, 55, 26, 53, 46, 57, 22, 35, 39, 80, 65, 83,
       82, 28, 19, 75, 18, 34, 37, 89, 32, 49, 23, 59, 62, 64, 61, 76, 41,
       87, 81, 58, 78, 68], dtype=int64)

In [309]:
df['Age'].isnull().sum()

0

Above we can see that all data of 'Age' column is clear, no null, and all of them are in the type of integer

In [310]:
df.head()

Unnamed: 0,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,...,Stress Level,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk
0,67,Male,208,158/88,72,0,0,1,0,0,...,9,6.615001,261404,31.251233,286,0,6,Argentina,South America,0
1,21,Male,389,165/93,98,1,1,1,1,1,...,1,4.963459,285768,27.194973,235,1,7,Canada,North America,0
2,21,Female,324,174/99,72,1,0,0,0,0,...,9,9.463426,235282,28.176571,587,4,4,France,Europe,0
3,84,Male,383,163/100,73,1,1,1,0,1,...,9,7.648981,125640,36.464704,378,3,4,Canada,North America,0
4,66,Male,318,91/88,93,1,1,1,1,0,...,6,1.514821,160555,21.809144,231,1,5,Thailand,Asia,0


2- Organizing the 'Sex' column

In [311]:
df['Sex'].unique()

array(['Male', 'Female'], dtype=object)

With the help of the One-hot-encoding, 'Sex' column is converted to is_male column

'Sex' --> 'is_male' (if male: 1, otherwise: 0)

In [312]:
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
df.rename(columns={'Sex_Male': 'is_male'},inplace=True)

# Drop first drops the original 'Sex' column. Now it is True if patient is Male otherwise it is False.

df['is_male'] = df['is_male'].astype(int)

# changing type to integer True -> 1 , False -> 0 so it won't be any problem when we use this column in model.

df.head()

Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk,is_male
0,67,208,158/88,72,0,0,1,0,0,4.168189,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,0,1
1,21,389,165/93,98,1,1,1,1,1,1.813242,...,4.963459,285768,27.194973,235,1,7,Canada,North America,0,1
2,21,324,174/99,72,1,0,0,0,0,2.078353,...,9.463426,235282,28.176571,587,4,4,France,Europe,0,0
3,84,383,163/100,73,1,1,1,0,1,9.82813,...,7.648981,125640,36.464704,378,3,4,Canada,North America,0,1
4,66,318,91/88,93,1,1,1,1,0,5.804299,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,0,1


3- Orginizing the Cholesterol Column

In [313]:
df['Cholesterol'].unique()

array([208, 389, 324, 383, 318, 297, 358, 220, 145, 248, 373, 374, 228,
       259, 122, 379, 166, 303, 340, 294, 359, 202, 133, 159, 271, 273,
       328, 154, 135, 197, 321, 375, 360, 263, 201, 347, 129, 229, 251,
       121, 190, 185, 279, 336, 192, 180, 203, 368, 222, 243, 218, 120,
       285, 377, 369, 311, 139, 266, 153, 339, 329, 333, 398, 124, 183,
       163, 362, 390, 200, 396, 255, 209, 247, 250, 227, 246, 223, 330,
       195, 194, 178, 155, 240, 237, 216, 276, 224, 326, 198, 301, 314,
       304, 334, 213, 254, 230, 316, 277, 388, 206, 384, 205, 261, 308,
       338, 382, 291, 168, 171, 378, 253, 245, 226, 281, 123, 173, 231,
       234, 268, 306, 186, 293, 161, 380, 239, 149, 320, 219, 335, 265,
       126, 307, 270, 225, 193, 148, 296, 136, 364, 353, 252, 232, 387,
       299, 357, 214, 370, 345, 351, 344, 152, 150, 131, 272, 302, 337,
       170, 356, 274, 188, 125, 138, 376, 181, 184, 275, 394, 128, 217,
       399, 283, 289, 284, 327, 262, 212, 350, 385, 162, 141, 36

In [314]:
df['Cholesterol'].isnull().sum()

0

Looks like 'Cholesterol' column is already organized, and we don't need to change anything

4- Blood Pressure

In [315]:
df['Blood Pressure'].unique()

array(['158/88', '165/93', '174/99', ..., '137/94', '94/76', '119/67'],
      dtype=object)

In [316]:
df['Blood Pressure'].isnull().sum()

0

We can see that there are not any NULL values for the Blood Pressure. However, it is in number/number format which we cannot use in our modelling.
We can separate the columns. Originally: (systolic/diastolic) , now I will create separately two columns systolic_pressure and diastolic_pressure

In [317]:
def handle_blood_pressure_systolic(value):
    value = str(value)
    value = value.split('/')
    return int(value[0])

def handle_blood_pressure_diastolic(value):
    value = str(value)
    value = value.split('/')
    return int(value[1])


df['systolic_pressure'] = df['Blood Pressure'].apply(handle_blood_pressure_systolic)
df['diastolic_pressure'] = df['Blood Pressure'].apply(handle_blood_pressure_diastolic)

df.drop(columns='Blood Pressure', axis=1, inplace=True)

df.head()

Unnamed: 0,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Diet,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk,is_male,systolic_pressure,diastolic_pressure
0,67,208,72,0,0,1,0,0,4.168189,Average,...,31.251233,286,0,6,Argentina,South America,0,1,158,88
1,21,389,98,1,1,1,1,1,1.813242,Unhealthy,...,27.194973,235,1,7,Canada,North America,0,1,165,93
2,21,324,72,1,0,0,0,0,2.078353,Healthy,...,28.176571,587,4,4,France,Europe,0,0,174,99
3,84,383,73,1,1,1,0,1,9.82813,Average,...,36.464704,378,3,4,Canada,North America,0,1,163,100
4,66,318,93,1,1,1,1,0,5.804299,Unhealthy,...,21.809144,231,1,5,Thailand,Asia,0,1,91,88


So far we have completed:
* Age
* Sex
* Cholesterol
* Blood Pressure

5- Organizing the Heart Rate Column

In [318]:
df['Heart Rate'].unique()

array([ 72,  98,  73,  93,  48,  84, 107,  68,  55,  97,  70,  85, 102,
        40,  56, 104,  71,  69,  66,  81,  52, 105,  96,  74,  49,  45,
        50,  46,  44, 106,  83,  86,  65, 101,  51,  43,  79,  90,  94,
        78,  92,  54, 109,  61,  64,  82, 110,  42,  63,  41, 100,  76,
        75,  58,  53,  60,  77,  47,  59,  57,  87,  67,  88,  99,  80,
        95, 108,  89,  62, 103,  91], dtype=int64)

In [319]:
df['Heart Rate'].isnull().sum()

0

In [320]:
df['Heart Rate'].isna().sum()

0

Looks like Heart Rate is already organized and cleaned. All values are in integer format, and the column does not contain any NaN value

In [321]:
df.head()

Unnamed: 0,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Diet,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk,is_male,systolic_pressure,diastolic_pressure
0,67,208,72,0,0,1,0,0,4.168189,Average,...,31.251233,286,0,6,Argentina,South America,0,1,158,88
1,21,389,98,1,1,1,1,1,1.813242,Unhealthy,...,27.194973,235,1,7,Canada,North America,0,1,165,93
2,21,324,72,1,0,0,0,0,2.078353,Healthy,...,28.176571,587,4,4,France,Europe,0,0,174,99
3,84,383,73,1,1,1,0,1,9.82813,Average,...,36.464704,378,3,4,Canada,North America,0,1,163,100
4,66,318,93,1,1,1,1,0,5.804299,Unhealthy,...,21.809144,231,1,5,Thailand,Asia,0,1,91,88


6- Organizing the 'Diabetes' Feature

In [322]:
df['Diabetes'].unique()

array([0, 1], dtype=int64)

In [323]:
df['Diabetes'].isnull().sum()

0

In [324]:
df['Diabetes'].isna().sum()

0

Diabetes column is already organized.

In [325]:
df.head()

Unnamed: 0,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Diet,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk,is_male,systolic_pressure,diastolic_pressure
0,67,208,72,0,0,1,0,0,4.168189,Average,...,31.251233,286,0,6,Argentina,South America,0,1,158,88
1,21,389,98,1,1,1,1,1,1.813242,Unhealthy,...,27.194973,235,1,7,Canada,North America,0,1,165,93
2,21,324,72,1,0,0,0,0,2.078353,Healthy,...,28.176571,587,4,4,France,Europe,0,0,174,99
3,84,383,73,1,1,1,0,1,9.82813,Average,...,36.464704,378,3,4,Canada,North America,0,1,163,100
4,66,318,93,1,1,1,1,0,5.804299,Unhealthy,...,21.809144,231,1,5,Thailand,Asia,0,1,91,88


7- Family History

In [326]:
df['Family History'].unique()

array([0, 1], dtype=int64)

In [327]:
df['Family History'].isnull().sum()

0

In [328]:
df['Family History'].isna().sum()

0

Family history is already a clean feature

8- Smoking

In [329]:
df['Smoking'].unique()

array([1, 0], dtype=int64)

In [330]:
df['Smoking'].isnull().sum()

0

In [331]:
df['Smoking'].isna().sum()

0

Smoking is already a clean feature

9- Obesity 

In [332]:
df['Obesity'].unique()

array([0, 1], dtype=int64)

In [333]:
df['Obesity'].isnull().sum()

0

In [334]:
df['Obesity'].isna().sum()

0

Obesity is already a clean feature

10- Alcohol Consumption

In [335]:
df['Alcohol Consumption'].unique()

array([0, 1], dtype=int64)

In [336]:
df['Alcohol Consumption'].isnull().sum()

0

In [337]:
df['Alcohol Consumption'].isna().sum()

0

Alcohol consumption is already a clean feature, no null values are found and in integer type.

In [338]:
df.head()

Unnamed: 0,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Diet,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Heart Attack Risk,is_male,systolic_pressure,diastolic_pressure
0,67,208,72,0,0,1,0,0,4.168189,Average,...,31.251233,286,0,6,Argentina,South America,0,1,158,88
1,21,389,98,1,1,1,1,1,1.813242,Unhealthy,...,27.194973,235,1,7,Canada,North America,0,1,165,93
2,21,324,72,1,0,0,0,0,2.078353,Healthy,...,28.176571,587,4,4,France,Europe,0,0,174,99
3,84,383,73,1,1,1,0,1,9.82813,Average,...,36.464704,378,3,4,Canada,North America,0,1,163,100
4,66,318,93,1,1,1,1,0,5.804299,Unhealthy,...,21.809144,231,1,5,Thailand,Asia,0,1,91,88


11- Exercise Hours Per Week

In [339]:
df['Exercise Hours Per Week'].unique()

array([ 4.16818884,  1.81324162,  2.07835299, ...,  3.14843791,
        3.78994983, 18.08174797])

In [340]:
df['Exercise Hours Per Week'].isnull().sum()

0

In [341]:
df['Exercise Hours Per Week'].isna().sum()

0

Exercise Hours Per Week column is also clean

12- Diet Column

In [342]:
df['Diet'].unique()

array(['Average', 'Unhealthy', 'Healthy'], dtype=object)

With the changes below, now Unhealthy is represented as 0, Average is represented as 1 and Healthy is represented as 2

In [343]:
def handle_diet(value):
    value = str(value)

    if value == 'Unhealthy':
        return 0
    elif value == 'Average':
        return 1
    elif value == 'Healthy':
        return 2
    else:
        return np.nan
    


df['Diet'] = df['Diet'].apply(handle_diet)
df['Diet']


0       1
1       0
2       2
3       1
4       0
       ..
8758    2
8759    2
8760    1
8761    0
8762    2
Name: Diet, Length: 8763, dtype: int64

The type of the Diet is now integer instead of Object.

In [344]:
df['Diet'].unique()

array([1, 0, 2], dtype=int64)

In [346]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 25 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Age                              8763 non-null   int64  
 1   Cholesterol                      8763 non-null   int64  
 2   Heart Rate                       8763 non-null   int64  
 3   Diabetes                         8763 non-null   int64  
 4   Family History                   8763 non-null   int64  
 5   Smoking                          8763 non-null   int64  
 6   Obesity                          8763 non-null   int64  
 7   Alcohol Consumption              8763 non-null   int64  
 8   Exercise Hours Per Week          8763 non-null   float64
 9   Diet                             8763 non-null   int64  
 10  Previous Heart Problems          8763 non-null   int64  
 11  Medication Use                   8763 non-null   int64  
 12  Stress Level        

13- Previous Heart Problems

In [347]:
df['Previous Heart Problems'].unique()

array([0, 1], dtype=int64)

In [348]:
df['Previous Heart Problems'].isnull().sum()

0

In [349]:
df['Previous Heart Problems'].isna().sum()

0

Previous heart problems column is already cleaned, no null values and all values are in integer format 

14- Medication Use

In [350]:
df['Medication Use'].unique()

array([0, 1], dtype=int64)

In [351]:
df['Medication Use'].isnull().sum()

0

In [352]:
df['Medication Use'].isna().sum()

0

Medication Use column is already cleaned, no null values and all values are in integer format

15- Stress Level

In [353]:
df['Stress Level'].unique()

array([ 9,  1,  6,  2,  7,  4,  5,  8, 10,  3], dtype=int64)

In [354]:
df['Stress Level'].isnull().sum()

0

Stress Level Column is already cleaned, no null values and all values are in integer format.

16- Sedentary Hours Per Day (Hours that are spent without moving)

In [355]:
df['Sedentary Hours Per Day'].unique()

array([6.61500145, 4.96345884, 9.46342584, ..., 2.37521373, 0.02910426,
       9.00523438])

In [356]:
df['Sedentary Hours Per Day'].isnull().sum()

0

In [357]:
df['Sedentary Hours Per Day'].isna().sum()

0

Sedentary Hours Per Day column is already cleaned, no null values and all values are in float format