# Final Project Part 1 - Proposal

### Target Variables, Features, Goals & Limitations

In this final project I will use the Statlog (Heart) Data Set from the UCI Machine Learning Repository containing information on heart disease and 13 attributes that will help me predict whether or not an individual has a heart disease.

The target variable in this data set is the Absence/Presence column which indicates 1 the absence or 2 the presence of heart disease. The feature columns include age, sex, chest pain type, resting blood pressure, serum cholestoral in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by flourosopy, and thal.

In the real world a model could be built off of similar (but more extensive) data to predict whether or not an individual has, or is at risk of developing, a heart disease. Or at least it could be used to find which of the included attributes are key indicators of heart disease.

The main limitation for this project is the somewhat small size of this data set. When trying to predict something as serious as heart disease in the real world, you want to be sure to have an ample amount of data to ensure your model is as accurate as possible.

### The Data

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

plt.style.use('fivethirtyeight')

In [3]:
%matplotlib inline

In [26]:
colnames=['age', 'sex', 'chest pain type', 'resting blood pressure', 'serum cholestoral in mg/dl', 'fasting blood sugar > 120 mg/dl', 'resting electrocardiographic results', 'maximum heart rate achieved', 'exercise induced angina', 'ST depression induced by exercise relative to rest', 'the slope of the peak exercise ST segment', 'number of major vessels colored by flourosopy', 'thal', 'Absence/Presence']
heart = pd.read_csv('data/heart.dat', delimiter=' ', index_col=False, names=colnames)
heart.sample(10)

Unnamed: 0,age,sex,chest pain type,resting blood pressure,serum cholestoral in mg/dl,fasting blood sugar > 120 mg/dl,resting electrocardiographic results,maximum heart rate achieved,exercise induced angina,ST depression induced by exercise relative to rest,the slope of the peak exercise ST segment,number of major vessels colored by flourosopy,thal,Absence/Presence
211,51.0,1.0,3.0,125.0,245.0,1.0,2.0,166.0,0.0,2.4,2.0,0.0,3.0,1
11,53.0,1.0,4.0,142.0,226.0,0.0,2.0,111.0,1.0,0.0,1.0,0.0,7.0,1
75,45.0,1.0,4.0,142.0,309.0,0.0,2.0,147.0,1.0,0.0,2.0,3.0,7.0,2
131,66.0,1.0,4.0,112.0,212.0,0.0,2.0,132.0,1.0,0.1,1.0,1.0,3.0,2
179,50.0,1.0,3.0,129.0,196.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,1
48,66.0,1.0,2.0,160.0,246.0,0.0,0.0,120.0,1.0,0.0,2.0,3.0,6.0,2
243,62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,2
262,58.0,1.0,2.0,120.0,284.0,0.0,2.0,160.0,0.0,1.8,2.0,0.0,3.0,2
29,71.0,0.0,3.0,110.0,265.0,1.0,2.0,130.0,0.0,0.0,1.0,1.0,3.0,1
256,61.0,1.0,3.0,150.0,243.0,1.0,0.0,137.0,1.0,1.0,2.0,0.0,3.0,1


In [19]:
heart.shape

(270, 14)

In [25]:
heart.describe()

Unnamed: 0,age,sex,chest pain type,resting blood pressure,serum cholestoral in mg/dl,fasting blood sugar > 120 mg/dl,resting electrocardiographic results,maximum heart rate achieved,exercise induced angina,ST depression induced by exercise relative to rest,the slope of the peak exercise ST segment,number of major vessels colored by flourosopy,thal,Absence/Presence
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296,1.444444
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659,0.497827
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,1.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0,1.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0,1.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,2.0


In [30]:
heart.dtypes

age                                                   float64
sex                                                   float64
chest pain type                                       float64
resting blood pressure                                float64
serum cholestoral in mg/dl                            float64
fasting blood sugar > 120 mg/dl                       float64
resting electrocardiographic results                  float64
maximum heart rate achieved                           float64
exercise induced angina                               float64
ST depression induced by exercise relative to rest    float64
the slope of the peak exercise ST segment             float64
number of major vessels colored by flourosopy         float64
thal                                                  float64
Absence/Presence                                        int64
dtype: object

In [32]:
heart.isna().sum()

age                                                   0
sex                                                   0
chest pain type                                       0
resting blood pressure                                0
serum cholestoral in mg/dl                            0
fasting blood sugar > 120 mg/dl                       0
resting electrocardiographic results                  0
maximum heart rate achieved                           0
exercise induced angina                               0
ST depression induced by exercise relative to rest    0
the slope of the peak exercise ST segment             0
number of major vessels colored by flourosopy         0
thal                                                  0
Absence/Presence                                      0
dtype: int64