# Decision Trees

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style("darkgrid")

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

### Data loading

We are going to use the churn data-set and do a binary classification task on it. You can load it from the data folder in Week 5.

In [None]:
df = pd.read_csv("PATH_TO_CHURN_DATA")
df.head()

Similar to what we did in assignment 3, We need to find the data points that doesn't have `TotalCharges` value and remove them.

In [None]:
z = df["TotalCharges"].map(lambda x: x.replace('.', '', 1).isdigit())
df = df[z]
df.reset_index(inplace=True)
df.shape

In [None]:
df["TotalCharges"] = df["TotalCharges"].astype(float)

We will use the following features:

In [None]:
X = df[["tenure", "MonthlyCharges", "TotalCharges"]]
y = df["Churn"]

Encoding the categorical features with __Label Encoding__:

In [None]:
le = LabelEncoder()
for col in ["gender", "PhoneService", "TechSupport", "StreamingTV", "PaperlessBilling"]:
    encoded_col = pd.DataFrame(le.fit_transform(df[col]), columns=[col])
    X = pd.concat((X, encoded_col), axis=1)
X.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [None]:
clf = DecisionTreeClassifier(criterion='entropy')

In [None]:
clf.fit(X_train, y_train)

In [None]:
# test accuracy
clf.score(X_test,y_test)

In [None]:
# depth of the decision tree
clf.get_depth()

### Tuning the depth of the tree:

The `max_depth` identifies the maximum depth of the tree. By default, it is set to None which means that the nodes are expanded until all leaves are pure or until all leaves contain less than `min_samples_split` samples (which is by default set to 2). Therefore with these default paramters our decision tree will have a very large depth and it will __over-fit__ on the training data.

#### Exercise:
Tune the `max_depth` parameter and find the best value for it. What is the precision and recall for the best decision-tree classifier?

In [None]:
scores = []
for d in range(1, 21):
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(scores)
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('depth', fontsize=15)

In [None]:
# best depth
np.argmax(scores)

### Feature importance
The importance of a feature is computed as the (normalized) total reduction of the criterion (which is entropy in this case) brought by that feature.

In [None]:
# feature importances for best classifier
clf = DecisionTreeClassifier(criterion='entropy', max_depth=4)
clf.fit(X_train, y_train)
clf.feature_importances_

In [None]:
sorted(zip(X_train.columns, clf.feature_importances_), key=lambda x: x[1], reverse=True)

What is the most important feature in this classification task? Does it make sense?

### Visualizing the decision tree

Let's visualize the decision tree with the optimum `max_depth` parameter.

In [None]:
!pip install pydotplus
!pip install graphviz

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

graph = Source(export_graphviz(clf, out_file=None
   , feature_names=X_train.columns, class_names=['No', 'Yes'] 
   , filled = True))
display(SVG(graph.pipe(format='svg')))

## Exercise: decision tree regressor

In this exercise we will use decision trees for a regression problem. We will use the Boston Housing dataset. This data-set contains information about houses in the suburbs of Boston. There are 506 samples and 14 attributes. For simplicity and visualization purposes, we will only use two — MEDV (median value of owner-occupied homes in $1000s) as the target and LSTAT (percentage of lower status of the population) as the feature.

In [None]:
from sklearn import datasets

boston = datasets.load_boston()            # Load Boston Dataset
df = pd.DataFrame(boston.data[:, 12])      # Create DataFrame using only the LSAT feature
df.columns = ['LSTAT']
df['MEDV'] = boston.target                 # Create new column with the target MEDV
df.head()

### train-test split

In [None]:
# TO DO

### Decision tree regressor
Now it is time to find a model which fits this data. Note that as we have a regression problem here, we need a criterion which is suitable for a continuous output. Tune the maximum depth of the tree and find the best value. What is the test error for this regressor?

In [None]:
# TO DO: training the decision tree regressor
from sklearn.tree import DecisionTreeRegressor
# an example of decision tree regressor with depth=3
tree = DecisionTreeRegressor(criterion='mse', max_depth=3) 

In [None]:
# TO DO: tune the max_depth parameter

In [None]:
# TO DO: test error for the best decision tree regressor

 
Plot the data points together with the regression tree line fit to see how good the model fits the data.

In [None]:
# TO DO
plt.figure(figsize=(16, 8))
plt.scatter(.., .., c='steelblue', edgecolor='white', s=70)   # Plot actual target against features
plt.plot(.., .., color='black', lw=2)               # Plot predicted target against features
plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000s [MEDV]')
plt.show()