<a href="https://colab.research.google.com/github/cpaniaguam/CSC104/blob/main/trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Trees and decision trees
[Trees](https://en.wikipedia.org/wiki/Tree_(data_structure)) are [data structures](https://en.wikipedia.org/wiki/List_of_data_structures).

![picture](https://runestone.academy/runestone/books/published/pythonds/_images/treerecs.png)

The topic of Data Representations for Data Structures in a programming language is an important part of the curriculum of a traditional second course in computer programming (see [CSC-300](http://catalog.salve.edu/preview_course_nopop.php?catoid=18&coid=19057) in the Salve catalog). An accecible reference is *Problem Solving with Algorithms and Data Structures using Python* by Miller and Ranum, [available for free at Runestone](https://runestone.academy/runestone/books/published/pythonds/index.html).



## An implementation of Decision Trees

Here we show the first steps in implementing a decision tree. The remaining details are left for you to complete in mini-project 5. For implementing decision trees in Python, a suitable representation needs to be chosen. In this notebook we show a possible approach using Pandas dataframes (to store data) and Python classes (to define the trees' data structure).

In [None]:
import pandas as pd
import numpy as np
from math import log2

Below are all the functions we are going to use.

In [None]:
def entropy(df,column,target=None):
    '''Compute entropy of column'''
    if target == None: target = column
    proportions = df[column].value_counts(normalize=True)
    return -sum([i*log2(i) for i in proportions])  

#entropy of split
def entropy_split(df,column,target):
    '''Does a split of df at column and computes entropy of target '''
    weights = df[column].value_counts(normalize=True)
    entropies = np.array([-df[df[column]==i][target]
                .value_counts(normalize=True).apply(lambda x: x*log2(x)).sum()
                for i in weights.index])
    return sum(weights*entropies)

def gains(df,columns,target):
    '''Used to compute the information gain after splitting the data at each
    column of df at the target column, and returns the column name with highest 
    gain.'''
    entro = np.array([entropy_split(df,target,col) for col in columns])
    d = dict(zip(columns,entropy(df,target,target)-entro))
    # print(sorted(d, key=d.get,reverse=True))
    return max(d, key=d.get)

def purity_check(df,target,threshold = 0):
    return True if entropy(df,target) <= threshold else False

def split(df,column):
    splt = []
    for val in df[column].unique():
        splt.append(df[df[column]==val])
    return splt


Here is the code for the trees' data structure. As you can see we are using a class.

In [None]:
# Tree class
class Node():
    def __init__(self, df=None, pure = None, split_col = None):
        self.branch = []
        self.data = df
        self.pure = pure # purity of all branches
        self.split_col = split_col # column used for splitting data

    def __str__(self):
        print(self.data)
        return ''

    def __repr__(self):
        print(self.data)
        return ''

    def _split(self,column):
        self.split_col = column
        splt = split(self.data,column)    
        for i in splt: 
            self.branch.append( Node(i) ) #create nodes with corresponding data
        print(self.branch)    
    
    # check purity of branch
    def _check_purity(self,target = None):
        '''Assumes target class is rightmost column'''
        if target == None: target = self.data.columns[-1]
        result = purity_check(self.data,target)
        self.pure = result #update purity
        return result

    #check purity of all branches
    def _pure_branches(self):
        for branch in self.branch:
            branch._check_purity()
        res = True if all([i.pure for i in self.branch]) else False
        self.pure = res
        return self.pure

    def get_impure_nodes(self):
        # update inpurities
        for node in self.branch:
            self._check_purity()
        return [i for i in range(len(self.branch)) if not self.branch[i].pure]
       

## Toy implementation
Below is a toy dataset we can manipulate by hand.

In [None]:
# some toy data for experimentation
df = pd.DataFrame({'age':list('y'*5+'m'*5+'o'*5),
                         'has_job':list('ffttffftffffttf'),
                         'own_house':list('ffftffftttttfff'),
                         'credit_rating':list('fggfffggeeeggef'),
                         'class_':list('nnyynnnyyyyyyyn')})
df.head()

Unnamed: 0,age,has_job,own_house,credit_rating,class_
0,y,f,f,f,n
1,y,f,f,g,n
2,y,t,f,g,y
3,y,t,t,f,y
4,y,f,f,f,n


In [None]:
# let us give the variable levels friendlier names
df.age.replace({'y':'young','m':'middle','o':'old'},inplace=True)
df.has_job.replace({'f':False,'t':True},inplace=True)
df.own_house.replace({'f':False,'t':True},inplace=True)
df.credit_rating.replace({'f':'fair','g':'good','e':'excellent'},inplace=True)
df.class_.replace({'n':'No','y':'Yes'},inplace=True)
df.head()

Unnamed: 0,age,has_job,own_house,credit_rating,class_
0,young,False,False,fair,No
1,young,False,False,good,No
2,young,True,False,good,Yes
3,young,True,True,fair,Yes
4,young,False,False,fair,No


In [None]:
# create the root node and pass in data in df above
root = Node(df)
root.__repr__

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
2    young     True      False          good    Yes
3    young     True       True          fair    Yes
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
7   middle     True       True          good    Yes
8   middle    False       True     excellent    Yes
9   middle    False       True     excellent    Yes
10     old    False       True     excellent    Yes
11     old    False       True          good    Yes
12     old     True      False          good    Yes
13     old     True      False     excellent    Yes
14     old    False      False          fair     No


<bound method Node.__repr__ of >

In [None]:
# take a look at the data in the root node
root.data.head()

Unnamed: 0,age,has_job,own_house,credit_rating,class_
0,young,False,False,fair,No
1,young,False,False,good,No
2,young,True,False,good,Yes
3,young,True,True,fair,Yes
4,young,False,False,fair,No


In [None]:
# take a look at the branches
# there are none as no splits have occurred
root.branch

[]

In [None]:
root.pure #nothing yet

In [None]:
# Before moving on, let us test our functions and class methods
entropy(root.data,'class_')

0.9709505944546686

In [None]:
split(root.data,'age')

[     age  has_job  own_house credit_rating class_
 0  young    False      False          fair     No
 1  young    False      False          good     No
 2  young     True      False          good    Yes
 3  young     True       True          fair    Yes
 4  young    False      False          fair     No,
       age  has_job  own_house credit_rating class_
 5  middle    False      False          fair     No
 6  middle    False      False          good     No
 7  middle     True       True          good    Yes
 8  middle    False       True     excellent    Yes
 9  middle    False       True     excellent    Yes,
     age  has_job  own_house credit_rating class_
 10  old    False       True     excellent    Yes
 11  old    False       True          good    Yes
 12  old     True      False          good    Yes
 13  old     True      False     excellent    Yes
 14  old    False      False          fair     No]

In [None]:
entropy_split(df,'age','class_')

0.8879430945988998

In [None]:
# Check the previous calculation by hand
-(2/3*(2/5*log2(2/5)+3/5*log2(3/5))+1/3*(4/5*log2(4/5)+1/5*log2(1/5)))

0.8879430945988998

In [None]:
# Everything seems in order. Let us now build our model.
# What variable to use for first split?
# We choose the one that renders maximun information gain
vartosplit = gains(root.data,root.data.columns[:-1],root.data.columns[-1])
vartosplit

'own_house'

In [None]:
#let us split the data in root node at vartosplit
root._split(vartosplit)

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
2    young     True      False          good    Yes
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
12     old     True      False          good    Yes
13     old     True      False     excellent    Yes
14     old    False      False          fair     No
       age  has_job  own_house credit_rating class_
3    young     True       True          fair    Yes
7   middle     True       True          good    Yes
8   middle    False       True     excellent    Yes
9   middle    False       True     excellent    Yes
10     old    False       True     excellent    Yes
11     old    False       True          good    Yes
[, ]


In [None]:
# Take a look at the branches
root.branch

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
2    young     True      False          good    Yes
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
12     old     True      False          good    Yes
13     old     True      False     excellent    Yes
14     old    False      False          fair     No
       age  has_job  own_house credit_rating class_
3    young     True       True          fair    Yes
7   middle     True       True          good    Yes
8   middle    False       True     excellent    Yes
9   middle    False       True     excellent    Yes
10     old    False       True     excellent    Yes
11     old    False       True          good    Yes


[, ]

In [None]:
# Are the branches pure? You can clearly see they are not, so other splits may
# be required.
root.branch[0]._check_purity('class_')

False

In [None]:
# is the other branch pure? (You can clearly see that it is)
root.branch[1]._check_purity('class_')

True

In [None]:
# See if last split produced pure nodes
root.pure

In [None]:
# Get impure nodes (indices here) to continue to split
imp_node = root.get_impure_nodes()
imp_node

[0]

In [None]:
# Where to split the impure node in the first branch?
vartosplit = gains(root.branch[0].data,root.branch[0].data.columns[:-1],'class_')
vartosplit

'has_job'

In [None]:
# Here is the impure node again
root.branch[0]

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
2    young     True      False          good    Yes
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
12     old     True      False          good    Yes
13     old     True      False     excellent    Yes
14     old    False      False          fair     No




In [None]:
# Let us now split it using vartosplit
root.branch[0]._split(vartosplit)

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
14     old    False      False          fair     No
      age  has_job  own_house credit_rating class_
2   young     True      False          good    Yes
12    old     True      False          good    Yes
13    old     True      False     excellent    Yes
[, ]


In [None]:
#Look at its branches after split
root.branch[0].branch

       age  has_job  own_house credit_rating class_
0    young    False      False          fair     No
1    young    False      False          good     No
4    young    False      False          fair     No
5   middle    False      False          fair     No
6   middle    False      False          good     No
14     old    False      False          fair     No
      age  has_job  own_house credit_rating class_
2   young     True      False          good    Yes
12    old     True      False          good    Yes
13    old     True      False     excellent    Yes


[, ]

In [None]:
# As you can see above, all nodes after split are pure,
# but let us verify this using a purity check on this node
root.branch[0]._pure_branches()

True

We now have a model, a decision tree. In mini-project 5 you will be asked to automate this process.

One more last thing. In what order were the splits made?

In [None]:
root.split_col

'own_house'

In [None]:
root.branch[1].split_col

In [None]:
root.branch[0].split_col

'has_job'

In [None]:
root.branch[0].branch[1]

      age  has_job  own_house credit_rating class_
2   young     True      False          good    Yes
12    old     True      False          good    Yes
13    old     True      False     excellent    Yes




In [None]:
root.branch[0].branch[1].split_col