**How can we find heart disease?**

Heart disease can refer to a wide range of diseases/pathologies, but in this case refers to coronary artery disease which involves reduction of blood flow to the heart due to atherosclerosis. Atherosclerotic plaque can build up in the arteries, and if near the heart can lead to heart attacks and other complications.

We will be attempting to classify heart disease based off of certain predictor variables related to heart health.

The heart disease data set contains data from 4 databases, including Cleveland, Hungary, Switzerland, and Long Beach (Virginia). The data from Cleveland was the only one processed.

It has 14 usable attributes, half of which are integer variables and the other half are categorical. Using various variables we will attempt to generate a predictive model than can accurately model a dia for heart disease.nosis  

In [1]:
#imports basic tools and the dataset below

import pandas as pd
import altair as alt
!pip3 install -U ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl.metadata (5.2 kB)
Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [2]:
from ucimlrepo import fetch_ucirepo

#imports dataset
heart_disease = fetch_ucirepo(name='Heart Disease')

display(heart_disease)

{'data': {'ids': None,
  'features':      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
  0     63    1   1       145   233    1        2      150      0      2.3   
  1     67    1   4       160   286    0        2      108      1      1.5   
  2     67    1   4       120   229    0        2      129      1      2.6   
  3     37    1   3       130   250    0        0      187      0      3.5   
  4     41    0   2       130   204    0        2      172      0      1.4   
  ..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
  298   45    1   1       110   264    0        0      132      0      1.2   
  299   68    1   4       144   193    1        0      141      0      3.4   
  300   57    1   4       130   131    0        0      115      1      1.2   
  301   57    0   2       130   236    0        2      174      0      0.0   
  302   38    1   3       138   175    0        0      173      0      0.0   
  
       slope   ca  thal  


The 1st variable we will analyze is age **(age)**, which is directly correlated with an increased risk of heart disease with prevalence as high as ~86% in some countries for those 80 and up.

The 2nd variable we will analyze is sex **(Sex)**, which contributes to the likelihood that one will develop heart disease. Generally, men develop heart disease at a younger age than women and have a higher risk than women. One of the reasons for this is that women are somewhat protected by estrogen and progesterone which boost blood vessel health. In this dataset 0 = Female, 1 = Male.

The 3rd variable we will include are types of chest pain (**cp**) which in the data set are:

    1) Typical Angina
    2) Atypical Angina
    3) Non-anginal Pain
    4) Asymptomatic
    
Angina refers to chest pain because of decreased blood flow to the heart, and in theory we should be able to pick out certain trends that will point towards heart disease, such as differentiating people with typical angina and those that are symptomatic.

The 4th variable is resting blood pressure **(trestbps)** which refers to how much strain is put on the blood vessel walls with each pump. It is a great indicator of overall heart health and high blood pressure or hypertension is linked to many health issues, including heart disease.

The 5th variable variable is serum cholesterol **(chol)** which refers to the amount of total cholesterol in a person's blood. Serum cholesterol level can be a predictor for heart disease because with high cholesterol levels, fatty deposits can develop in blood vessels, making it difficult for enough blood to flow through the arteries.

The 6th and last variable we will look at is maximum heart rate **(thalach)**, where a higher maximum heart rate is generally correlated with excess heart reserve and better heart function. 

Generally these variables were chosen as the most relevant predictors and others were excluded due to relevancy or overlap with some of the variables we already chose.



In [12]:
#Sets up classifiers and diagnosis
classifiers = heart_disease.data.features 
diagnosis = heart_disease.data.targets 

#Creates dataframe
heart_df = classifiers.assign(diagnosis = diagnosis)

#Drops all columns except for predictors and diagnosis
heart_df_filter = heart_df[["age", "sex", "cp", "trestbps", "chol", "thalach", "diagnosis"]]
display(heart_df_filter)

#Imports relevant sk-learn tools
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

train_heart, test_heart = train_test_split(heart_df_filter, train_size = 0.75)

#Grabs all the different diagnoses to get an initial table
train_heart_mean = train_heart.groupby("diagnosis").mean(numeric_only=True).reset_index()

train_heart_mean 

Unnamed: 0,age,sex,cp,trestbps,chol,thalach,diagnosis
0,63,1,1,145,233,150,0
1,67,1,4,160,286,108,2
2,67,1,4,120,229,129,1
3,37,1,3,130,250,187,0
4,41,0,2,130,204,172,0
...,...,...,...,...,...,...,...
298,45,1,1,110,264,132,1
299,68,1,4,144,193,141,2
300,57,1,4,130,131,115,3
301,57,0,2,130,236,174,1


Unnamed: 0,diagnosis,age,sex,cp,trestbps,chol,thalach
0,0,51.280992,0.570248,2.842975,128.066116,244.809917,160.909091
1,1,54.684211,0.842105,3.368421,130.605263,251.0,144.710526
2,2,58.172414,0.862069,3.724138,133.137931,261.965517,135.793103
3,3,55.222222,0.777778,3.777778,132.814815,243.407407,131.62963
4,4,60.5,0.833333,3.666667,137.833333,254.25,141.666667


Initially we can see from our datasets a couple of obvious trends, first off age being much lower for the null diganosis and higher in groups with more severe diagnoses.

A null diganosis was also correlated with much lower average sex value, indicating females are at much lower risk of heart disease which checks out with current literature.

Chest pain was high for diagnoses 2,3, and 4 while 0 and 1 had lower average reported chest pain, with a null diagnosis having the lowest. Recall that 4 was typical angina and 0 was asymptomatic.

Cholesterol varied throughout all the groups and may or may not present a link to any of the diagnoses. 

Maximum heart rate varied throughout all the groups, but notably those without a digan