<h1>Decision Trees - Universal Approximators</h1>

<h2>1. Introduction </h2>

Decision trees are universal approximators that use recursive partitioning to divide the datasets into homogenous subgroups.

<img src="media/decision_trees_.png" width="400px"/>

The <b>top node</b> is referred to as the <b>root node</b> and is the starting decision node. (i.e., Gender is Male or Female?). A <b>branch</b> is a subset of the dataset obtained as an outcome of a test. <b>Internal nodes</b> are decision nodes based on which subsequent branches are obtained. The <b>depth</b> of a node is the minimum number of decisions it takes to reach it from the root node. The leaf nodes are the end of the last branches on the tree which determine the output (class label or regression value).

<h2>2. Building a decision tree</h2>

Given a dataset of <b>n features and m records</b>, a rule-based graph is formed <b>iteratively by recursive partitioning</b> until the datasets is split in homogenous data groups representing the <b>same target class</b> in a classification problem or <b>sharing close target values</b> in a regression problem .

1. From the root node (i.e. with all the m records), the most informative attribute is identified using some feature important score. The <b>Gini index</b> is the most commonly used feature importance score among others (entropy, information gain)

$$ Gini(f) = \sum_{i=1}^{N_c}P(class=i|f)(1-P(class=i|f))Â  = 1 - \sum_{i=1}^{N_c}P(class=i|f)^2 $$

Overall Gini coefficient:

$$
Gini(f) = \frac{n_{S_i}}{n_{S_i}+n_{S_j}}Gini(f_{S_i}) + \frac{n_{S_j}}{n_{S_i}+n_{S_j}}Gini(f_{S_j})
$$

<b>The feature with the lowest gini index is selected</b>

For a regression problem, the quality of the split is typically measured using the mean squre error:

$$
\bar{y} = \frac{1}{n_{S_i}}\sum_{y\in S_i}^{}y
$$

$$
MSE(S_i) = \frac{1}{n_{S_i}}\sum_{i=1}^{n_{S_i}}(\bar{y}-y_i)^2
$$

2- Given an appropriate feature importance selection criterion, the decision tree is thus built as follows by recursive partitioning.

<b>Decision Tree Pseudo-code</b>

Step 1: Given M attributes in a dataset N records and a target variable y<br/>
Step 2: Rank features as per the chosen feature importance score<br/>
Step 3: Split the dataset by the feature with the best importance score<br/>
Step 4: Repeat Step 2 to each new subset until a stopping criterion is met

<h3>3. Pruning a decision tree</h3>

A decision tree can reach 100% fitting accuracy on the training set given that it can further split the data until a single data (i.e. guaranteed homogeneity) remains. However, this comes with the risk that the algorithm may lose its generalisation capability on unseen data. A pruning phase may post-process the decision tree, undermine some rules and allow some level of heterogeneity in the data subgroups to secure generalisation on unseen data.

<h2>Python Implementation</h2>

In [5]:
import pandas as pd
df = pd.read_csv('../datasets/Healthcare-Diabetes.csv')

