# DSCI 100 Group 2 Project Proposal

## Heart Failure Prediction




### Introduction

Cardiovascular diseases (CVDs) are the largest contributor to death globally (about 31%, or 17.9 million lives per year). This data set observes a range of variables related to the heart that could potentially predict a heart disease, including cholesterol levels, types of chest pain, blood pressure and sugar, as well as variables regarding age, sex, and the presence or absence of heart disease. This project will attempt to answer the following predictive question: what variable(s) most strongly predict(s) presence of a heart disease?

The dataset we will be using, which is found on Kaggle [here][1], is the combination of five different datasets within the UCI Machine Learning Repository’s Heart Disease Data Set found [here][2]. The attributes within this dataset are described as follows:

[1]: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
[2]: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

1. Age: age of patient (years)
2. Sex: sex (M = male, F = female)
3. ChestPainType: chest pain type (TA = Typical Angina, ATA = Atypical Angina, NAP = Non-Anginal Pain, ASY = Asymptomatic)
4. RestingBP: resting blood pressure (mm/Hg)
5. Cholesterol: Serum cholesterol (mg/dl)
6. FastingBS: fasting blood sugar (1 if FastingBS > 120 mg/dl, otherwise 0)
7. RestingECG: resting electrocardiographic results = (Normal = Normal, ST = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH = showing probable or definite left ventricular hypertrophy by Estes' criteria)
8. MaxHR: maximum heart rate achieved (value between 60 and 202)
9. ExerciseAngina: exercise induced angina (Y = Yes, N = No)
10. Oldpeak: ST depression induced by exercise relative to rest
11. ST_Slope: slope of the peak exercise ST segment (Up = upsloping, Flat = flat, Down = downsloping)
12. HeartDisease: output/prediction (1 = heart disease, 0 = normal)

### Preliminary Exploratory Data Analysis

In [2]:
import altair as alt
import numpy as np
import pandas as pd
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [3]:
url = "https://raw.githubusercontent.com/caojason/dsci-100-group-2/main/data/heart.csv"
heart_data = pd.read_csv(url)
heart_data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [4]:
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


We can see that our dataset has 918 observations for each attribute and no missing data

In [7]:
heart_data_training, heart_data_testing = train_test_split(
    heart_data, test_size=0.25, random_state=1234
)
heart_data_training

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
736,54,M,ASY,122,286,0,LVH,116,Y,3.2,Flat,1
539,57,M,ASY,110,197,0,LVH,100,N,0.0,Up,0
895,57,M,ASY,110,335,0,Normal,143,Y,3.0,Flat,1
697,58,M,ASY,150,270,0,LVH,111,Y,0.8,Up,1
194,41,F,ATA,125,184,0,Normal,180,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
204,56,M,ATA,130,184,0,Normal,100,N,0.0,Up,0
53,41,F,ATA,130,245,0,Normal,150,N,0.0,Up,0
294,32,M,TA,95,0,1,Normal,127,N,0.7,Up,1
723,59,M,ASY,140,177,0,Normal,162,Y,0.0,Up,1


Note that num (classifications) 1-4 indicate various heart diseases. So, with this in mind, it can be seen that those with chest pain type 4 are also those with the highest number of heart diseases. We will include type of chest pain in our classifier as it seems there could be a connection between type of chest pain and type of heart disease.

### Methods

### Expected Outcomes and Significance