# 1. Problem Statement

### 1.1

In this final project I will use the Statlog (Heart) Data Set from the UCI Machine Learning Repository containing information on heart disease and 13 attributes that will help me predict whether or not an individual has a heart disease.

In the real world a model could be built off of similar (but more extensive) data to predict whether or not an individual has, or is at risk of developing, a heart disease. Or at least it could be used to find which of the included attributes are key indicators of heart disease.

The main limitation for this project is the somewhat small size of this data set. When trying to predict something as serious as heart disease in the real world, you want to be sure to have an ample amount of data to ensure your model is as accurate as possible.

### 1.2 Target Variable

The target variable in this data set is the Absence/Presence column which indicates 1 (the absence) or 2 (the presence) of heart disease. 

### 1.3 Feature Variables

The feature columns include age, sex, chest pain type, resting blood pressure, serum cholestoral in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by flourosopy, and thal.

# 2. Data Preparation

### 2.1 Importing Necessary Libraries and Data

In [9]:
#Importing libraries I may need
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [16]:
#Here I define the column names as given to me in 'data/heart.doc'
colnames=['age', 'sex', 'chest_pain', 'resting_bp', 'cholesterol', 'fasting_bs', 'resting_ecg', 'max_hr', 'exercise_angina', 'oldpeak', 'slope_of_ST segment', '#_of_major_vessels', 'thal', 'Disease_Presence']
#Load in the data and assign the column names
heart = pd.read_csv('data/heart.dat', delimiter=' ', index_col=False, names=colnames)
#View the head of the data to make sure it has loaded correctly
heart.head()

Unnamed: 0,age,sex,chest_pain,resting_bp,cholesterol,fasting_bs,resting_ecg,max_hr,exercise_angina,oldpeak,slope_of_ST segment,#_of_major_vessels,thal,Disease_Presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


In [17]:
print (heart.shape)

(270, 14)


### 2.2 Data Cleanup

In [13]:
#Checking for missing values
heart.isna().sum()
#Since there are none, I can continue with the full set of data

age                    0
sex                    0
chest_pain             0
resting_bp             0
cholesterol            0
fasting_bs             0
resting_ecg            0
max_hr                 0
exercise_angina        0
oldpeak                0
slope_of_ST segment    0
#_of_major_vessels     0
thal                   0
Disease_Presence       0
dtype: int64

In [14]:
heart.describe()

Unnamed: 0,age,sex,chest_pain,resting_bp,cholesterol,fasting_bs,resting_ecg,max_hr,exercise_angina,oldpeak,slope_of_ST segment,#_of_major_vessels,thal,Disease_Presence
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296,1.444444
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659,0.497827
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,1.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0,1.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0,1.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,2.0


In [None]:
Data Analysis

### 4. Modeling

In [None]:
#Creating a feature matrix DataFrame and target vector series
target_col = 'Disease_Presence'
X = heart.drop(target_col, axis='columns')
y = heart.loc[:, target_col]

In [None]:
# Do train/test split on feature and target columns
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
Evaluation