**DATA PROCESSING**

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
from matplotlib import pyplot as plt
print ("Imported all libraries successfully...")

**LOAD HEART FAILURE DATA**

https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

We imported a heart failure dataset found on Kaggle and converted it to a Pandas DataFrame in this step.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

heart_failure_df = pd.read_csv("heart.csv")

heart_failure_df.info()

In [None]:
heart_failure_df.head()

**REMOVE DUPLICATES**

We checked for and removed duplicates in the dataset.

In [None]:
number_of_duplicates = heart_failure_df.duplicated().sum()
print (f" Number of duplicates before : {number_of_duplicates}")

# Delete duplicate rows
breast_cancer_df = heart_failure_df.drop_duplicates()

number_of_duplicates = heart_failure_df.duplicated().sum()
print (f" Number of duplicates after removing : {number_of_duplicates}")

**CONVERT CATAGORICAL COLUMNS TO NUMERICAL**

There are quite a few columns that are categorical that need to be converted to numerical in order for the machine learning model to work. We changed:

1. Sex (binary): 0 for male; 1 for female
2. ExerciseAngina (binary): 0 for Yes; 1 for No
3. ChestPainType: 0 for Asymptomatic; 1 for Atypical Angina; 2 for Non-Anginal Pain; 3 for Typical Angina
4. RestingECG: 0 for Normal; 1 for Left Ventricular Hypertrophy; 2 for ST-T wave abnormality.
5. ST_Slope: 0 for Flat; 1 for Up; 2 for Down

In [None]:
def featurize(df):
  x = df[['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'HeartDisease']]
  print(x)
  x['Sex'] = [1 if x== 'F' else 0 for x in df['Sex']]
  x['ExerciseAngina'] = [1 if x== 'N' else 0 for x in df['ExerciseAngina']]

  chestpain_groups = {
      'ASY':0,
      'ATA':1,
      'NAP':2,
      'TA':3,
  }

  x['ChestPainType'] = [chestpain_groups.get(x.strip()) for x in df['ChestPainType']]

  resting_ecg_groups = {
      'Normal':0,
      'LVH':1,
      'ST':2
  }

  x['RestingECG'] = [resting_ecg_groups.get(x.strip()) for x in df['RestingECG']]

  st_slope_groups = {
      'Flat':0,
      'Up':1,
      'Down':2
  }

  x['ST_Slope'] = [st_slope_groups.get(x.strip()) for x in df['ST_Slope']]

  return x

heart_data = featurize(heart_failure_df)
# display(heart_data.head())

Convert cleaned heart data to DataFrame.

In [None]:
heart_data_df = pd.DataFrame(heart_data)
display(heart_data_df)

In [None]:
# Save DataFrame as CSV
heart_data_df.to_csv("heart_data_df.csv")

**SPLIT DATA FOR TRAINING**

We split the data into seperate training and testing data frames. 80% of the data is designated for training while 20% is designated for testing.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(heart_data_df, test_size=0.2, random_state = 100)
train_df.to_csv("heart_failure_train.csv")
test_df.to_csv("heart_failure_test.csv")