<h1>Classification Trees</h1>

```
Carseats.csv
A data frame containing observations on sales of child car seats at 400 different stores and the following 11 variables:

Sales        Unit sales (in thousands) at each location
CompPrice    Price charged by competitor at each location
Income       Community income level (in thousands of dollars)
Advertising  Local advertising budget for company at each location (in thousands of dollars)
Population   Population size in region (in thousands)
Price        Price company charges for car seats at each site
ShelveLoc    A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age          Average age of the local population
Education    Education level at each location
Urban        A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US           A factor with levels No and Yes to indicate whether the store is in the US or not
```

In [None]:
import pydot
from IPython.display import Image
from six import StringIO  
from sklearn.tree import export_graphviz

# This function creates images of tree models using pydot
def print_tree(estimator, features, class_names=None, filled=True):
    tree = estimator
    names = features
    color = filled
    classn = class_names
    
    dot_data = StringIO()
    export_graphviz(estimator, out_file=dot_data, feature_names=features, class_names=classn, filled=filled)
    (graph,) = pydot.graph_from_dot_data(dot_data.getvalue())
    return(graph)

In [None]:
import pandas as pd
df = pd.read_csv('https://r-data.pmagunia.com/system/files/datasets/dataset-11424.csv')
df.head()

In [None]:
# Pre-processing the data

df['High'] = df.Sales.map(lambda x: 1 if x>8 else 0)
df.ShelveLoc = pd.factorize(df.ShelveLoc)[0]
df.Urban = df.Urban.map({'No':0, 'Yes':1})
df.US = df.US.map({'No':0, 'Yes':1})
df.head()

In [None]:
X = df.drop(['Sales', 'High'], axis=1)
y = df.High

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

<h2>The Basics</h2>

A classification tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one, i.e., we predict that each observation belongs to the <em>most commonly occurring class</em> of training
observations in the region to which it belongs.

The task of growing a classification tree is quite similar to the task of growing a regression tree. Just as in the regression setting, we use recursive binary splitting to grow a classification tree. However, in the classification setting, $RSS$ cannot be used as a criterion for making the binary splits.

<ins>__Option 1__</ins>: Classification Error Rate

Since we plan to assign an observation in a given region to the most commonly occurring error rate class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:

$$E=1-\max _{k}\left(\hat{p}_{m k}\right)$$

Here $\hat{p}_{m k}$ represents the proportion of training observations in the $m$th region that are from the $k$ th class.

In [None]:
def classification_error(p):
    return 1 - np.max([p, 1 - p])

<ins>__Option 2__</ins>: Gini Index

The Gini index is defined by
$$
G=\sum_{k=1}^{K} \hat{p}_{m k}\left(1-\hat{p}_{m k}\right)
$$
a measure of total variance across the $K$ classes. It is not hard to see that the Gini index takes on a small value if all of the $\hat{p}_{m k}$ 's are close to zero or one. For this reason the Gini index is referred to as a measure of node purity-a small value indicates that a node contains predominantly observations from a single class.

In [None]:
def gini(p):
    return 2*(p)*(1 - p)

<ins>__Option 3__</ins>: Cross-Entropy

$$
D=-\sum_{k=1}^{K} \hat{p}_{m k} \log \hat{p}_{m k}\left(\frac{1}{2\log(2)}\right)
$$

Since $0 \leq \hat{p}_{m k} \leq 1$, it follows that $0 \leq-\hat{p}_{m k} \log \hat{p}_{m k}$. One can show that the entropy will take on a value near zero if the $\hat{p}_{m k}$ 's are all near zero or near one. Therefore, like the Gini index, the entropy will take on a small value if the $m$ th node is pure.

👉🏼 In fact, it turns out that the Gini index and the entropy are quite similar numerically.

In [None]:
def entropy(p):
    return (p*np.log((1-p)/p) - np.log(1 - p)) / (2*np.log(2))

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0.0, 1.0, 0.01)
class_error_vals = [classification_error(i) for i in x]
gini_vals = gini(x)
entropy_vals = [entropy(i) if i != 0 else None for i in x]

fig = plt.figure()
ax = plt.subplot()

for j, lab, c, in zip(
    [class_error_vals, gini_vals, entropy_vals],
    ['Class. Error Rate', 'Gini Index', 'Cross-entropy'],
    ['red', 'blue', 'green']):
    line = ax.plot(x, j, label=lab, linestyle='-', lw=3, color=c)

ax.legend(loc='lower center', fancybox=True, shadow=False)

plt.ylim([0, 0.52])
plt.xlabel('p')
plt.ylabel('Impurity Index: E, G, D')
plt.show()

<h2>Example</h2>

The [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from `sklearn` only options are the `gini` and the `entropy` impurity indexes:

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf_gini = DecisionTreeClassifier(criterion='gini',max_depth=6)
clf_gini.fit(X_train, y_train)
clf_entropy = DecisionTreeClassifier(criterion='entropy',max_depth=6)
clf_entropy.fit(X_train, y_train)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, clf_gini.predict(X_train)))
print(classification_report(y_train, clf_entropy.predict(X_train)))

In [None]:
graph = print_tree(clf_gini, features=X_train.columns, class_names=['No', 'Yes'])
Image(graph.create_png())

In [None]:
graph = print_tree(clf_entropy, features=X_train.columns, class_names=['No', 'Yes'])
Image(graph.create_png())

In [None]:
pred_gini = clf_gini.predict(X_test)
pred_entropy = clf_entropy.predict(X_test)

In [None]:
# Precision of the model using test data is 75%
print(classification_report(y_test, pred_gini))

In [None]:
# Precision of the model using test data is 73%
print(classification_report(y_test, pred_entropy))

<h2>Choosing Hyperparameters</h2>

In [None]:
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score

# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10,random_state=42)

param_grid = [
    {'criterion': ['gini','entropy'], 'max_depth': range(2, 10)}
]

scoring = make_scorer(precision_score,greater_is_better=True)
g_cv = GridSearchCV(DecisionTreeClassifier(random_state=42),
              param_grid=param_grid,
              scoring=scoring, cv=cv,n_jobs=-1)

g_cv.fit(X_train,y_train)

In [None]:
results_df = pd.DataFrame(g_cv.cv_results_)
results_df = results_df.sort_values(by=['rank_test_score'])
results_df = (
    results_df
    .set_index(results_df["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('params')
)
results_df[
    ['params', 'rank_test_score', 'mean_test_score', 'std_test_score']
]

In [None]:
precision_score(y_test, g_cv.best_estimator_.predict(X_test))