# Decision Trees
Decision Trees are ML algorithms that progressively divide data sets into smaller data groups based on a descriptive feature, until they reach sets that are small enough to be described by some label.

### Main DT Algorithms
####  1. CHAID - Chi-squared Automatic Interaction Detection
When building **classification trees**, CHAID relies on chi-squared tests to find the best split at each step. In other words, it chooses the independent variable that has the strongest interaction with the dependent variable. For **regression trees**, CHAID relies on F-tests to calculate the difference between two population means.

#### 2. CART - Classification And Regression Trees
In the case of **Classification Trees**, CART algorithm uses a metric called **Gini Impurity** to create decision points for classification tasks. Gini Impurity gives an idea of how fine a split is. In the case of **Regression Trees**, CART algorithm looks for splits that minimize the **Least Square Deviation (LSD)**, choosing the partitions that minimize the result over all possible options. The LSD (sometimes referred as “variance reduction”) metric minimizes the sum of the squared distances (or deviations) between the observed values and the predicted values.

#### 3. ID3 - Iterative Dichotomiser 3
It is mostly used for classification tasks. ID3 splits data attributes (dichotomizes) to find the most dominant features, performing this process iteratively to select the DT nodes in a top-down approach. For the splitting process, ID3 uses the **Information Gain** metric to select the most useful attributes for classification. Information Gain is directly linked to the concept of **Entropy**, which is the measure of the amount of uncertainty or randomness in the data.

#### 4. C 4.5 
It is successor of ID3. C4.5 can handle both continuous and categorical data, making it suitable to generate Regression and Classification Trees. Additionally, it can deal with missing values by ignoring instances that include non-existing data. Unlike ID3, C4.5 uses **Gain Ratio** for its splitting process. Gain Ratio is a modification of the Information Gain concept that reduces the bias on DTs. Another capability of C4.5 is that it can prune DTs.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Here we will implement CART algorithm for classification task
#### Important points
1. The representation of the CART model is a binary tree.
2. For regression, The cost function that is minimized to choose split points is the sum squared error across all training samples that fall within the rectangle.
3. For classification, The Gini cost function is used which provides an indication of how pure the nodes are, where node purity refers to how mixed the training data assigned to each node is.

### 1. Gini Index
* Name of the cost function used to evaluate splits in the dataset.
* Performs only binary splits
* Higher the value of Gini higher the homogeneity. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group result in a Gini score of 0.5

**Steps to calculate Gini**:
1. Calculate Gini for sub-nodes, using formula sum of square of probability.
2. Calculate Gini for split using weighted Gini score of each node of that split

$$ Gini\;for\;subgroup = 1 - \sum p^2 $$
$$ where \; p = probability\;of\;each\;class $$


$$ Gini\;for\;split = \sum Gini\;for\;subgroup * \frac{subgroup\;size}{total\;size} $$



### 2. Terminal Node
When to decide to stop growing the tree

**Maximum Tree Depth** : This is the maximum number of nodes from the root node of the tree. Once a maximum depth of the tree is met, we must stop splitting adding new nodes. Deeper trees are more complex and are more likely to overfit the training data.

**Minimum Node Records** :  This is the minimum number of training patterns that a given node is responsible for. Once at or below this minimum, we must stop splitting and adding new nodes. Nodes that account for too few training patterns are expected to be too specific and are likely to overfit the training data.

There is one more condition. It is possible to choose a split in which all rows belong to one group. In this case, we will be unable to continue splitting and adding child nodes as we will have no records to split on one side or another.

Now we have some ideas of when to stop growing the tree. When we do stop growing at a given point, that node is called a terminal node and is used to make a final prediction.

This is done by taking the group of rows assigned to that node and selecting the most common class value in the group. This will be used to make predictions.

In [2]:
class DecisionTreeClassifier():
    def __init__(self):
        self.root = None
        self.tree_depth = 0
        self.X = None
        self.y = None
    
    def gini_index(self,groups, y):
        n_instances = len(groups[0])+len(groups[1])  # count of all samples
        gini = 0.0 # sum weighted Gini index for each group
        for indexes in groups:
            size = len(indexes)
            if size == 0: continue # avoid divide by zero
            score = 0.0
            # score the group based on the score for each class
            for class_val in np.unique(y):
                p = (y[indexes]==class_val).sum()/size 
                score += p*p
            # weight the group score by its relative size
            gini +=  (1-score) * (size / n_instances)
        return gini
    
    def get_split(self,X,y):
        b_index, b_value, b_score, b_groups = float('inf'), float('inf'), float('inf'), None
        for col_ind in range(X.shape[1]): #for each features
            for val in np.unique(X[:,col_ind]): #for each unique value of that feature

                #left_index indexes lower than val for feature, right_index indexes greater that val for feature
                left_index, right_index = np.argwhere(X[:,col_ind]<val), np.argwhere(X[:,col_ind]>=val)
                
                #remove redundant axis
                left_index = np.squeeze(left_index,axis=1) if len(left_index.shape)>1 else left_index
                right_index = np.squeeze(right_index,axis=1) if len(right_index.shape)>1 else right_index
                
                #find gini index
                gini = self.gini_index((left_index,right_index), y)
                if gini < b_score:
                    b_index, b_value, b_score, b_groups = col_ind, val, gini, (left_index, right_index)

        return {'index':b_index, 'value':b_value, 'groups':b_groups}
    
    
    def to_terminal(self,classes):
        # Create a terminal node value
        cls,cnt = np.unique(classes,return_counts=True) 
        return cls[np.argmax(cnt)]
    
    def split(self, node, X, y, max_depth, min_samples_split, depth):
        self.tree_depth = max(depth,self.tree_depth)
        left, right = node['groups']
        del node['groups']
        
        # check for a no split
        if len(left)==0 or len(right)==0:
            node['left'] = node['right'] = self.to_terminal(y[np.append(left,right)])
            return
        
        # check for max depth
        if depth >= max_depth:
            node['left'], node['right'] = self.to_terminal(y[left]), self.to_terminal(y[right])
            return
        
        # process left child
        if len(left) <= min_samples_split:
            node['left'] = self.to_terminal(y[left])
        else:
            node['left'] = self.get_split(X[left],y[left])
            self.split(node['left'], X[left], y[left], max_depth, min_samples_split, depth+1)
        
        # process right child
        if len(right) <= min_samples_split:
            node['right'] = self.to_terminal(y[right])
        else:
            node['right'] = self.get_split(X[right],y[right])
            self.split(node['right'],X[right],y[right], max_depth, min_samples_split, depth+1)
                
    def fit(self,X,y, max_depth=None, min_samples_split=2):
        self.X, self.y, max_depth = X, y, float('inf') if max_depth==None else max_depth
        self.root = self.get_split(X,y)
        self.split(self.root, X, y, max_depth, min_samples_split,1)
        
    def predict(self,rows):
        return np.array([ self.predict_row(row,self.root) for row in rows ])
        
    def predict_row(self,row,node):
        if row[node['index']] < node['value']:
            if isinstance(node['left'], dict):  return self.predict_row(row,node['left'])
            else: return node['left']
        else:
            if isinstance(node['right'], dict): return self.predict_row(row,node['right'])
            else: return node['right']
            
    def score(self,X,y): 
        return (y==self.predict(X)).sum()/len(y)
    
    @property 
    def max_depth(self): return self.tree_depth
    
    @property 
    def tree_(self): return self.root

### Case Study : Banknote

In [3]:
data = pd.read_csv('data_banknote.txt',header=None,
                   names=[ 'Variance of Wavelet Transformed Image', 'Skewnes of Wavelet Transformed Image',
                            'Curtosis of Wavelet Transformed Image','Entropy of Image','Class'])
data = data.sample(frac=1)
data.head()

Unnamed: 0,Variance of Wavelet Transformed Image,Skewnes of Wavelet Transformed Image,Curtosis of Wavelet Transformed Image,Entropy of Image,Class
1133,-2.0046,-0.49457,1.333,1.6543,1
67,2.4235,9.5332,-3.0789,-2.7746,0
468,4.5707,7.2094,-3.2794,-1.4944,0
308,4.616,10.1788,-4.2185,-4.4245,0
828,-2.5912,-0.10554,1.2798,1.0414,1


In [4]:
def train_test_split(X,y,test_size=0.3):
    indexes = np.random.choice( [False,True], len(y), p=[test_size,1-test_size], replace=True )
    return X[indexes],X[~indexes],y[indexes],y[~indexes]

X = data.drop('Class',axis=1).values
y = data.Class.values

X_train,X_val,Y_train,Y_val = train_test_split(X,y)
X_train.shape,X_val.shape,Y_train.shape,Y_val.shape

((961, 4), (411, 4), (961,), (411,))

In [5]:
dt = DecisionTreeClassifier()
dt.fit(X_train,Y_train)

In [6]:
dt.score(X_val,Y_val),dt.max_depth

(0.975669099756691, 9)

### Case Study : Iris dataset

In [7]:
data = pd.read_csv('iris.data',header=None,names=['Sepal length','Sepal width', 'Petal length', 'Petal Width', 'Class'])
data.Class.replace({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2},inplace=True)
data = data.sample(frac=1)
data.head()

Unnamed: 0,Sepal length,Sepal width,Petal length,Petal Width,Class
48,5.3,3.7,1.5,0.2,0
143,6.8,3.2,5.9,2.3,2
84,5.4,3.0,4.5,1.5,1
108,6.7,2.5,5.8,1.8,2
42,4.4,3.2,1.3,0.2,0


In [8]:
def train_test_split(X,y,test_size=0.3):
    indexes = np.random.choice( [False,True], len(y), p=[test_size,1-test_size], replace=True )
    return X[indexes],X[~indexes],y[indexes],y[~indexes]

X = data.drop('Class',axis=1).values
y = data.Class.values

X_train,X_val,Y_train,Y_val = train_test_split(X,y)
X_train.shape,X_val.shape,Y_train.shape,Y_val.shape

((108, 4), (42, 4), (108,), (42,))

In [9]:
np.unique(Y_val,return_counts=True)

(array([0, 1, 2]), array([14, 14, 14]))

In [10]:
dt = DecisionTreeClassifier()
dt.fit(X_train,Y_train,max_depth=4)

In [11]:
dt.score(X_val,Y_val),dt.depth

AttributeError: 'DecisionTreeClassifier' object has no attribute 'depth'