<h1>Decision Trees - Learning Nonlinearities using Rules</h1>

<b>Outline</b>
<ul>
    <li>Decision Trees: Concept</li>
    <li>Decision Trees: Examining nonlinearity learning</li>
    <li>Decision Trees: Boosting generalisation via pruning</li>
</ul>

<h2>1. Introduction </h2>

Decision trees are universal approximators that use recursive partitioning to divide the datasets into homogenous subgroups.

<img src="../Regression/media/decision_trees_.png" width="400px"/>

The <b>top node</b> is referred to as the <b>root node</b> and is the starting decision node. (i.e., Gender is Male or Female?). A <b>branch</b> is a subset of the dataset obtained as an outcome of a test. <b>Internal nodes</b> are decision nodes based on which subsequent branches are obtained. The <b>depth</b> of a node is the minimum number of decisions it takes to reach it from the root node. The leaf nodes are the end of the last branches on the tree which determine the output (class label or regression value).

<h2>2. Building a decision tree</h2>

Given a dataset of <b>n features and m records</b>, a rule-based graph is formed <b>iteratively by recursive partitioning</b> until the datasets is split in homogenous data groups representing the <b>same target class</b> in a classification problem or <b>sharing close target values</b> in a regression problem .

1. From the root node (i.e. with all the m records), the most informative attribute is identified using some feature important score. The <b>Gini index</b> is the most commonly used feature importance score among others (entropy, information gain)

$$ Gini(f) = \sum_{i=1}^{N_c}P(class=i|f)(1-P(class=i|f))Â  = 1 - \sum_{i=1}^{N_c}P(class=i|f)^2 $$

Overall Gini coefficient:

$$
Gini(f) = \frac{n_{S_i}}{n_{S_i}+n_{S_j}}Gini(f_{S_i}) + \frac{n_{S_j}}{n_{S_i}+n_{S_j}}Gini(f_{S_j})
$$

<b>The feature with the lowest gini index is selected</b>

For a regression problem, the quality of the split is typically measured using the mean squre error:

$$
\bar{y} = \frac{1}{n_{S_i}}\sum_{y\in S_i}^{}y
$$

$$
MSE(S_i) = \frac{1}{n_{S_i}}\sum_{i=1}^{n_{S_i}}(\bar{y}-y_i)^2
$$

2- Given an appropriate feature importance selection criterion, the decision tree is thus built as follows by recursive partitioning.

<b>Decision Tree Pseudo-code</b>

Step 1: Given M attributes in a dataset N records and a target variable y<br/>
Step 2: Rank features as per the chosen feature importance score<br/>
Step 3: Split the dataset by the feature with the best importance score<br/>
Step 4: Repeat Step 2 to each new subset until a stopping criterion is met

<h3>3. Pruning a decision tree</h3>

A decision tree can reach 100% fitting accuracy on the training set given that it can further split the data until a single data (i.e. guaranteed homogeneity) remains. However, this comes with the risk that the algorithm may lose its generalisation capability on unseen data. A pruning phase may post-process the decision tree, undermine some rules and allow some level of heterogeneity in the data subgroups to secure generalisation on unseen data.

<h2>Python Implementation</h2>

In [18]:
import pandas as pd

df = pd.read_csv('../../datasets/Healthcare-Diabetes.csv')
df.head()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,6,148,72,35,0,33.6,0.627,50,1
1,2,1,85,66,29,0,26.6,0.351,31,0
2,3,8,183,64,0,0,23.3,0.672,32,1
3,4,1,89,66,23,94,28.1,0.167,21,0
4,5,0,137,40,35,168,43.1,2.288,33,1


In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df.iloc[:, 1:9]#get features
y = df.iloc[:,[-1]]#get target variable

#---Data Scaling
Sc = StandardScaler()
Sc.fit(X)
X_d = Sc.transform(X)

In [20]:
#--Train - Test Split
X_train, X_test,y_train,y_test = train_test_split(X_d,y, test_size=0.2, random_state=1234)

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dt_parameters = {'max_depth':[None, 8, 10], 'min_samples_leaf':[1,2,3]}

dt = DecisionTreeClassifier()
clf = GridSearchCV(dt, dt_parameters,cv=5)
clf.fit(X_train,y_train)
clf.best_params_

{'max_depth': None, 'min_samples_leaf': 1}

In [22]:
from sklearn import metrics
from sklearn.metrics import classification_report

y_pred_test = clf.predict(X_test)
class_acc_dt = metrics.accuracy_score(y_test,y_pred_test)#get classification accuracy

targets = ['no-diabetes','has-diabetes']
print("Decision Trees - Test Performance: \n",classification_report(y_test,y_pred_test,target_names=targets))

Decision Trees - Test Performance: 
               precision    recall  f1-score   support

 no-diabetes       0.99      1.00      0.99       349
has-diabetes       1.00      0.98      0.99       205

    accuracy                           0.99       554
   macro avg       0.99      0.99      0.99       554
weighted avg       0.99      0.99      0.99       554

