<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Decision Tree and Random Forests 

_Author and Instructor: **Dr. Junaid Qazi, PhD**_


Hi Guys,<br>

Welcome to the Decision Tree and Random Forests lecture using scikit-learn in Python. <br>

After learning key concepts on Decision Tree and Random Forests in the theory lecture, let's move on and use another famous dataset on [Heart Disease in Cleveland](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). This original and full dataset is a part of [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) and contains 4 databases: ***Cleveland, Hungary, Switzerland, and the VA Long Beach.*** This dataset was donated to UCI Repository in 1988.<br>
The original database contains 76 attributes, but all published experiments by machine learning researchers refer to using a subset of 14 of them. <br>
In particular, the **Cleveland database** is the only one that has been used by the Machine Learning researchers to this date. In the original database, the "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).<br>

We are also using Cleveland database in this section. You can download the original one from the UCI website or use the one provided along with this course. I recommend using the one provided in the curse material because it is already cleaned for the missing data. A new column 'target' is also added with N (for 0) and Y (for 1,2,3,4).<br>

If you are interested to know more about the databases, please visit the link provided at the beginning.<br>
The information on the 14 attribute, that we are going to use, is provided below: 

* **age**--in years
* **sex**--(1 = male; 0 = female)
* **cp**--chest pain type (1: typical angina, 2: atypical angina, 3: non-anginal, pain 4: asymptomatic)  
* **trestbps** -- resting blood pressure
* **chol**--serum cholesterol in mg/dl
* **fbs**--fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
* **restecg**--resting ecg (electrocardiographic) results
* **thalach**--maximum heart rate achieved 
* **exang**--exercise induced angina (1 = yes; 0 = no)
* **oldpeak**--ST depression induced by exercise relative to rest 
* **slope**--the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping) 
* **ca**: number of major vessels (0-3) colored by flourosopy 
* **thal**: 3 = normal; 6 = fixed defect; 7 = reversable defect 
* **the predicted attribute** (0, 1, 2, 3 4) -- In the processed dataset, this one is added as a new column 'target' with 'N' for 0 and 'Y' for 1,2,3 & 4.<br>

Let's move on to the jupyter notebook and learn by doing.

#### Let's import the libraries and learn by doing!

```Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

In [None]:
# Code here please

### Reading the data file in df

```Python
df = pd.read_csv("HD_Cleveland_Data_Clean.csv")
```

In [None]:
# Code here please

```Python
df.head()
```

In [None]:
# Code here please

```Python
df.info()
```

In [None]:
# Code here please

```Python
#some statistics you might be interested in!
df.describe()
```

In [None]:
# Code here please

### Exploratory Data Analysis (EDA) -- only few plots

```Python
df['target'].value_counts()
```

In [None]:
# Code here please

```Python
#If you are interested in pair plot, its big so I am just avoiding this!
#sns.pairplot(data=df[['age','sex','chol','fbs','target']], hue='target')
```

```Python
sns.countplot(x='target',data=df, hue='sex', palette='coolwarm')
```

In [None]:
# Code here please

So, in the collected data, most of the men were diagnosed with the heart disease! 

```Python
#thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
sns.countplot(x='thal',data=df, palette='coolwarm')#, hue='sex')
```

In [None]:
# Code here please

```Python
# Let's see how age is related to cholesterol, chol.
sns.jointplot(x='age', y='chol', data=df)
```

In [None]:
# Code here please

```Python
#thalach is the maximum heart rate achieved
sns.jointplot(x='age', y='thalach', data=df)
```

In [None]:
# Code here please

### Machine Learning Section
Our focus is Machine Learning, lets split the data and move on to the Machine Learning. **If you want, you can do more EDA so that you get even better understanding of your dataset.** <br>
We will start with training a single decision tree and than compare the results with Random Forest but first, we need to do train test split!
#### Train Test Split

Let's split up the data into a training set and a test set!

```Python
from sklearn.model_selection import train_test_split
```

In [None]:
# Code here please

```Python
# Let's have a quick look at the colum names!
df.columns
```

In [None]:
# Code here please

```Python
X = df.drop('target',axis=1)
y = df['target']

# the code below is same as above 
#X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
#        'thalach','exang', 'oldpeak', 'slop', 'ca', 'thal']]
#y = df['target']
```

In [None]:
# Code here please

```Python
# shift+tab and simply copy
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42)
```

In [None]:
# Code here please

#### Decision Trees

We'll start with training a single decision tree!

```Python
# importing decision tree classifier
from sklearn.tree import DecisionTreeClassifier
```

In [None]:
# Code here please

```Python
#Creating instance "dtree" of the classifier 
dtree = DecisionTreeClassifier()
```

In [None]:
# Code here please

```Python
#fitting to the training data, the default parameters are fine at the moment!
dtree.fit(X_train,y_train)
```

In [None]:
# Code here please

### Prediction and Evaluation 

Evaluation is important to see how did the model work!

```Python
# doing predictions 
predictions = dtree.predict(X_test)
```

In [None]:
# Code here please

```Python
from sklearn.metrics import classification_report,confusion_matrix
```

In [None]:
# Code here please

```Python
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
```

In [None]:
# Code here please

It looks like we are doing quite good using decision tree, the model is mislabeling some! <br><br><br>
### Random Forests
Let's try Random Forests model on the data and compare our results with the decision tree model. Random Forests is under ensemble class in the sklearn.


```Python
from sklearn.ensemble import RandomForestClassifier
```

In [None]:
# Code here please

If you remember theory lecture, we need to pass the number of trees in the forest which is n_estimators. The default is 10. <br>
Let's pass 100 at the moment and fit the model to the training dataset.<br>&#9758;* You can play with n_estimators by changing different numbers!*

```Python
# Creating instance and fitting the model
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
```

In [None]:
# Code here please

**Note:** Follow [this link](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) if you want to learn more about the parameter in RandomForestClassifier. We have only passed the n_estimatores = 100 here and this is the one we frequently use. The default values for all other parameters are considered.<br>


```Python
# doing predictions
rfc_pred = rfc.predict(X_test)
```

In [None]:
# Code here please

```Python
# Evaluation
print(classification_report(y_test,rfc_pred))
print(confusion_matrix(y_test,rfc_pred))
```

In [None]:
# Code here please

It looks like the random forest is gave improved results over a single tree for the dataset we have used. We got the better precision, recall and f1-score using Random Forest and less number of mislabeled samples!<br>
You will see, if the dataset gets larger and larger, the Random Forests will always do better than a single decision tree. In the current situation, the data set is not very large but still Random Forests model works better than decision trees, the model will outshines with larger data sets. 
## Excellent work! 
So far, we have done great. Let's move on and do the project and practice our skills.

## Let's look at the feature importance for both tree and random forests.

```Python
df_feature_importance= pd.DataFrame(X_test.columns, columns = ['features'])
df_feature_importance['feature_imp_Tree']=dtree.feature_importances_
df_feature_importance['feature_imp_RF']=rfc.feature_importances_
df_feature_importance.sort_values(by = ['feature_imp_Tree'], ascending = False, inplace = True)
df_feature_importance.head(2)
# if you want, you can set age as index column
```

In [None]:
# Code here please

```Python
df_feature_importance.plot(x='features', kind = 'bar', figsize = (16, 4))
plt.ylabel('Importance')
```

In [None]:
# Code here please

### Optional
### Tree Visualization 
There is a built-in decision tree visualization capabilities in scikit learn. However, we don't use this most of the time because we often work with the Random Forests and it is much harder to visualize them because we have hundreds of trees rather than a single decision tree. We need to install pydot library for the tree visualization.<br>
You need to install pydot using `pip install pydot` on terminal.<br>
You also need `graphviz` library which is not actually the to display the image. Install both of these libraries and copy-paste the code from the reference note book to see how you tree looks like!

from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot

#Lets separate the features 
features=df.columns[:-1]

dot_data = StringIO()  
export_graphviz(dtree, out_file=dot_data,
                feature_names=features,filled=True,rounded=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())  

In [None]:
import pickle 

# save the model to disk
filename = 'final_model.sav'

# rfc is our model, see above!
# file will be stored on the disk, see the working directory
pickle.dump(rfc, open(filename, 'wb')) # wb stands for reading only in binary format


# load the model from disk
loaded_model = pickle.load(open(filename, 'rb')) # rb stands for reading only in binary format

# let's do predictions using stored model after loading
predictions = loaded_model.predict(X_test)

# Let's pass the y_test and predictions to get the confusion_matrix
print(confusion_matrix(y_test, predictions))