# Heart Attack Analysis & Prediction Project

The purpose of this project is to analyze existing features of patients provided by the input dataset, and predict whether a patient will have a heart attack or not. The input dataset has a target column which is a value either 0 or 1 indicating whether a patient got a heart attack (1) or not (0). I will attempt to use a classification model to predict whether a patient will get a heart attack or not.

## 1. Import the data into the notebook

In [19]:
import os
import pandas as pd
import warnings
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import Image
warnings.filterwarnings("ignore")

# This function will take in a file path and a filename and then attempt to load the file in
def load_data(file_path , filename):
    csv_path = os.path.join(file_path, filename)
    return pd.read_csv(csv_path)

In [20]:
# This function will be used to write output files to act as checkpoints in the project
def write_csv_data(file_path, filename, df):
    csv_path = os.path.join(file_path, filename)
    df.to_csv(csv_path)
    
    if os.path.exists(csv_path) and os.path.getsize(csv_path) > 0:
        print(filename + " was written to successfully!")

In [21]:
input_file_path = "Input/"

df_heart_data = load_data(input_file_path, "heart.csv")

## 2. Take a quick look at the imported dataframe
We will use several functions to get a feel for the data such as head, info, and describe.

In [22]:
df_heart_data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [23]:
df_heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


The info function will give us details on whether any fields contains null values, type mismatches, or missing values. Based on the results above it is safe to say that there is no null values, no type mismatches, and no missing values.

In [24]:
df_heart_data.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## 3. Rename columns to be more descriptive
I find the columns are very obscure and it will make more sense to rename the columns so it is easy to use later on.

In [25]:
def rename_columns_heart_data(df):
    df = df.rename(columns={"cp":"chest_pain_type", "trtbps":"resting_blood_pressure", "chol":"cholesterol", "fbs": "fasting_blood_sugar", "restecg": "resting_electrocardiographic_results"})
    df = df.rename(columns={"thalachh": "maximum_heart_rate", "exng": "exercise_induced_angina", "oldpeak":"previous_peak", "slp":"slope", "caa":"num_major_vessels", "thall":"thal_rate", "output":"heart_attack_target"})
    
    return df

In [26]:
df_heart_data = rename_columns_heart_data(df_heart_data)

In [27]:
df_heart_data.head()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,maximum_heart_rate,exercise_induced_angina,previous_peak,slope,num_major_vessels,thal_rate,heart_attack_target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 4. Create a test set
We need to set aside some of the data so we do not analyze it at all. This will be our test set to evaluate our classification model down the line.

In [28]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_heart_data, test_size=0.2, random_state=42)