**Decoding Heart Disease**

**Introduction**


Heart disease can refer to a wide range of diseases, but in this case refers to coronary artery disease which involves reduction of blood flow to the heart due to atherosclerosis. Atherosclerotic plaque can build up in the arteries, and if near the heart can lead to heart attacks and other complications. The causes of this plaque build up aren’t exactly clear, but gradually builds up accompanied by inflammation of the arterial walls, reducing blood flow (John Hopkins Medicine, 2021). 


The data set contains data from 4 databases, including Cleveland, Hungary, Switzerland, and Long Beach (Virginia), the data from Cleveland was the only one processed.


It has 14 usable attributes, half of which are integer variables and the other half are categorical. Our project revolves around answering the question, **what factors contributes to heart disease?**


Heart disease is of growing concern, especially as humans continue to grow older and has become the leading factor of death for humans around the world accounting for 16% of total deaths globally (WHO’s Global Health Estimates, 2020). It also disproportionately affects wealthier nations and more importantly in this investigation has correlations with many health metrics and biological markers (WHO’s Global Health Estimates, 2020). 


In [None]:
#imports basic tools and the dataset below

import pandas as pd
import altair as alt
!pip3 install -U ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo

#imports dataset
heart_disease = fetch_ucirepo(name='Heart Disease')

In [None]:
#Sets up classifiers and diagnosis
classifiers = heart_disease.data.features 
diagnosis = heart_disease.data.targets 

#Creates dataframe
heart_df = classifiers.assign(diagnosis = diagnosis)

#Drops all columns except for predictors and diagnosis
heart_df_filter = heart_df[["age", "sex", "cp", "trestbps", "chol", "thalach", "diagnosis"]]

#Imports relevant sk-learn tools
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

train_heart, test_heart = train_test_split(heart_df_filter, train_size = 0.75)

#Grabs all the different diagnoses to get an initial table
train_heart_mean = train_heart.groupby("diagnosis").mean(numeric_only=True).reset_index()

train_heart_mean 

Initially we can see a couple of trends just by grouping the data by diagnosis, most notably a null diagnosis being correlated with youth, a lower sex value (0 being female), lower chest pain (1 being asymptomatic), lower resting blood pressure and a higher maximum heart rate.


References

John Hopkins Medicine. (2021). Atherosclerosis. Hopkins Medicine. https://www.hopkinsmedicine.org/health/conditions-and-diseases/atheroscleros

is
WHO’s Global Health Estimates. (2020, December 9). The top 10 causes of death. World Health Organization; WHO. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death