# Predicting Heart Disease with Random Forests

The dataset is taken form the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

It includes 303 rows, and 14 columns. The goal is to predict heart disease. Here are the description of the column features.

1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type (1: typical angina, 2: atypical anginav, 3: non-anginal pain, 4: asymptomatic)
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes' criteria)
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
-- Value 0: < 50% diameter narrowing
-- Value 1: > 50% diameter narrowing

# 1. Download the Data
Download the data to the same folder as your jupyter notebook if you have not done so already.
See github/coreyjwade/heart_disease.

# 2. Prepare Data for ML
Run the next five cells to import the data, check it, then split it into feature and target columns for ML.

In [None]:
# Import pandas
import pandas as pd

# Import numpy
import numpy as np

# Import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Import seaborn
import seaborn as sns

# Set seaborn darkgrid
sns.set()

In [None]:
# Upload heart.csv to dataFrame
df = pd.read_csv('heart.csv')

In [None]:
# Display first give rows of dataFrame
df.head()

In [None]:
# Check info
df.info()

In [None]:
# Select target column
y = df['target']

# Select feature columns
X = df.drop(['target'], axis=1)

# Import train_test_split
from sklearn.model_selection import train_test_split

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

# 3. Building random forests
It's time to build a random forest model!

In [None]:
# Import Random Forest Classifier


# Import accuracy score


# Instantiate Random Forest Classifier


# Fit model to data


# Predict y_values


# Show accuracy


In [None]:
# Define random forest function
'''
This function takes a random forest classifier as input, and outputs the accuracy of the model on pre-defined 
X_train, y_train, X_test, y_test. RandomForestClassifier and accuracy_score should already be imported.
'''
def random_forest_classifier(rf=RandomForestClassifier(n_estimators=100, n_jobs=-1)):
     
    # Fit model to data
    rf.fit(X_train, y_train)
    
    # Predict y_values 
    y_pred = rf.predict(X_test)
    
    # Define accuracy
    accuracy = accuracy_score(y_pred, y_test)
    
    # Return accuracy
    return accuracy

In [None]:
# Run random_forest_classifier


In [None]:
# Change n_estimators to 250


In [None]:
# Change n_estimators to 500


In [None]:
# Change n_estimators to 50


1. What happens if you change the hyperparmater max_depth? 
2. How can you see all the features that are available?
3. What else can we do to improve the score?

# 4. feature_importances_
We will now find the most important features

In [None]:
# Display feature_importances_ of columns


In [None]:
# Zip columns and feature_importances_ into dict
feature_dict = dict(zip(X.columns, rf.feature_importances_))

# Import operator
import operator

# Sort dict by values (as list of tuples)
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)

# 5. Bonus Cells

In [None]:
# Plot histograms of most important features

In [None]:
# Change hyperparamters to try and improve model

In [1]:
# Run GridSearchCV or RandomSearchCV to optimze hyperparamters

# 6. Independent Work

1. Go to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview.
2. Download data.
3. Read train.csv into a dataFrame
4. Use RandomForestRegressor to predict sale price of houses

Note - RandomForestRegressor is implemented the same way as RandomForestClassifier. The only difference is the scoring. Instead of accuracy_score, use the root mean squared error with mean_squared_error as follows:

- from sklearn.metrics import mean_squared_error as MSE

- mse_test = MSE(y_test, y_pred)

- rmse_test = mse_test**0.5