# Decision Tree

## Where To Split?
The essence of splitting is to reduce impurity. We will continue the split until the impurity is zero, i.e. all data are homogeneous.

In [93]:
import numpy as np


class Student(object):
    def __init__(self):
        self.is_cricket_player = False
        self.gender = 'female'
        self.height = 5.4
        self.grade = 'A'


np.random.seed(314)

students = []
for i in range(30):
    students.append(Student())

# Randomly select 15 students to play cricket
for i in np.random.choice(len(students), size=15):
    students[i].is_cricket_player = True
    students[i].gender = 'male'
    
for i in np.random.choice(len(students), size=15):
    students[i].gender = 'male'

for i in np.random.choice(len(students), size=15):
    students[i].height = 5.8
    
for i in np.random.choice(len(students), size=15):
    students[i].grade = 'B'

### Gini Index
If we select two items from a population, given that it is pure, then the probaility of them being of same class is 1. Higher the value of Gini index, higher the homogeneity. We can only perform 2 split using Gini index.

1. Calculate Gini index for sub-nodes, using formula $P(S)^{2} + P(F)^{2}$.
2. Calculate Gini for split using weighted Gini score on each sub-node.

In [80]:
def gini_index(students, splitter):
    # c1 and c2 stand for class 1 and class 2 respectively, e.g. male or female
    c1_pos_num, c1_neg_num = 0, 0
    c2_pos_num, c2_neg_num = 0, 0
    
    for student in students:
        if splitter(student):
            if student.is_cricket_player:
                c1_pos_num += 1
            else:
                c1_neg_num += 1
        else:
            if student.is_cricket_player:
                c2_pos_num += 1
            else:
                c2_neg_num += 1
    
    # Calculating Gini for sub-node c1
    c1_total = (c1_pos_num + c1_neg_num)
    c1_gini = (float(c1_pos_num) / c1_total)**2 + (float(c1_neg_num) / c1_total)**2
    
    # Calculating Gini for sub-node c2
    c2_total = (c2_pos_num + c2_neg_num)
    c2_gini = (float(c2_pos_num) / c2_total)**2 + (float(c2_neg_num) / c2_total)**2
    
    # Return weighted score
    return c1_gini*c1_total/len(students) + c2_gini*c2_total/len(students)

In [51]:
print 'Split on gender %f' % gini_index(students, lambda student: student.gender == 'male')
print 'Split on grade %f' % gini_index(students, lambda student: student.grade == 'A')
print 'Split on height %f' % gini_index(students, lambda student: student.height > 5.5)

Split on gender 0.733333
Split on grade 0.520362
Split on height 0.523445


This is telling us that being male is strongly correlated with playing cricket.

### Chi-Square
It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent-node.

1. Calculate $\chi^{2}$ for sub-node by calculating the deviation for success and failure.
2. Cauclate $\chi^{2}$ for split using the sum of all sub-node $\chi^{2}$.

In [81]:
def chi_square(students, splitter):
    # c1 and c2 stand for class 1 and class 2 respectively, e.g. male or female
    c1_pos_num, c1_neg_num = float(0), float(0)
    c2_pos_num, c2_neg_num = float(0), float(0)
    
    for student in students:
        if splitter(student):
            if student.is_cricket_player:
                c1_pos_num += 1
            else:
                c1_neg_num += 1
        else:
            if student.is_cricket_player:
                c2_pos_num += 1
            else:
                c2_neg_num += 1
    
    # Assuming 50/50 probability for positive and negative because we know that 15 out of 30
    # students are cricket players. We expect the child node to inherit the same distribution.
    expected_c1_pos_num = 0.50 * (c1_pos_num + c1_neg_num)
    expected_c1_neg_num = 0.50 * (c1_pos_num + c1_neg_num)

    expected_c2_pos_num = 0.50 * (c2_pos_num + c2_neg_num)
    expected_c2_neg_num = 0.50 * (c2_pos_num + c2_neg_num)
    
    return (c1_pos_num - expected_c1_pos_num)**2 / expected_c1_pos_num \
        + (c1_neg_num - expected_c1_neg_num)**2 / expected_c1_neg_num \
        + (c2_pos_num - expected_c2_pos_num)**2 / expected_c2_pos_num \
        + (c2_neg_num - expected_c2_neg_num)**2 / expected_c2_neg_num

In [46]:
print 'Split on gender %f' % chi_square(students, lambda student: student.gender == 'male')
print 'Split on grade %f' % chi_square(students, lambda student: student.grade == 'A')
print 'Split on height %f' % chi_square(students, lambda student: student.height > 5.5)

Split on gender 14.000000
Split on grade 1.221719
Split on height 1.406699


Once again, the `chi_square` method is predicting the same thing as `gini_index`, that gender is the biggest predicator for whether the student is a cricket player.

### Information Gain
The more pure a node is, the less information we need to describe it. On the other hand, impure node requires more information to describe. Thus, we can use entropy as a metric to decide whether a sample is homogenous. If the entropy is zero, it means sample is completely pure and homogeneous. If entropy is one, it means sample is equally divided into two classes.

$$
E = -P(S)\;log_{2}P(S) - P(F)\;log_{2}P(F)
$$

$S$ stands for success and $F$ stands for failure.

In [96]:
def entropy(pos_count, neg_count):
    total = pos_count + neg_count
    
    if pos_count > 0 and neg_count > 0:
        return -1*(pos_count/total)*np.log(pos_count/total) \
            - (neg_count/total)*np.log(neg_count/total)
    elif pos_count > 0 and neg_count == 0:
        return -1*(pos_count/total)*np.log(pos_count/total)
    elif pos_count == 0 and neg_count > 0:
        return -1*(neg_count/total)*np.log(neg_count/total)
    
    return None


def information_gain(students, splitter):
    # c1 and c2 stand for class 1 andfemale class 2 respectively, e.g. male or female
    c1_pos_num, c1_neg_num = float(0), float(0)
    c2_pos_num, c2_neg_num = float(0), float(0)
    
    for student in students:
        if splitter(student):
            if student.is_cricket_player:
                c1_pos_num += 1
            else:
                c1_neg_num += 1
        else:
            if student.is_cricket_player:
                c2_pos_num += 1
            else:
                c2_neg_num += 1
    
    c1_total = c1_pos_num + c1_neg_num
    c1_entropy = entropy(c1_pos_num, c1_neg_num)
        
    c2_total = c2_pos_num + c2_neg_num
    c2_entropy = entropy(c2_pos_num, c2_neg_num)
    
    return (c1_total/len(students))*c1_entropy + (c2_total/len(students))*c2_entropy

In [97]:
print 'Split on gender %f' % information_gain(students, lambda student: student.gender == 'male')
print 'Split on grade %f' % information_gain(students, lambda student: student.grade == 'A')
print 'Split on height %f' % information_gain(students, lambda student: student.height > 5.5)

Split on gender 0.381909
Split on grade 0.672634
Split on height 0.669440


Split on gender gives the least entropy.

## CART (Classification and Regression Trees)