<a href="https://colab.research.google.com/github/cpaniaguam/CSC104/blob/main/trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Trees and decision trees
[Trees](https://en.wikipedia.org/wiki/Tree_(data_structure)) are [data structures](https://en.wikipedia.org/wiki/List_of_data_structures).

![picture](https://runestone.academy/runestone/books/published/pythonds/_images/treerecs.png)

The topic of Data Representations for Data Structures in a programming language is an important part of the curriculum of a traditional second course in computer programming. See [CSC-300](http://catalog.salve.edu/preview_course_nopop.php?catoid=18&coid=19057) in the Salve catalog. An accecible reference is *Problem Solving with Algorithms and Data Structures using Python* by Miller and Ranum, [available for free at Runestone](https://runestone.academy/runestone/books/published/pythonds/index.html).



## An implementation of Decision Trees

Here we show the first steps in implementing a decision tree. The remaining details are left for you to complete in mini-project 5. For implementing decision trees in Python, a suitable representation needs to be chosen. In this notebook we show a possible approach using Pandas dataframes (to store data) and Python classes (to define the trees' data structure).

In [None]:
import pandas as pd
import numpy as np
from math import log2

Below are all the functions we are going to use.

In [None]:
def gains(df,columns,target):
    entro = np.array([entropy_split(df,target,col) for col in columns])
    d = dict(zip(columns,entropy(df,target,target)-entro))
    # print(sorted(d, key=d.get,reverse=True))
    return max(d, key=d.get)

#entropy of split
def entropy_split(df,target,column):
    weights = df[column].value_counts(normalize=True)
    entropies = np.array([-df[df[column]==i][target]
                .value_counts(normalize=True).apply(lambda x: x*log2(x)).sum()
                for i in weights.index])
    return sum(weights*entropies)

def entropy(df,column,target=None):
    if target == None: target = column
    proportions = df[column].value_counts(normalize=True)
    return -sum([i*log2(i) for i in proportions])

def purity_check(df,target):
    return True if entropy(df,target)==0 else False

def split(df,column):
    splt = []
    for val in df[column].unique():
        splt.append(df[df[column]==val])
    return splt


Here is the code for the trees' data structure. As you can see we are using a class.

In [None]:
# Tree class
class Node():
    def __init__(self, df=None, pure = None, split_col = None):
        self.branch = []
        self.data = df
        self.pure = pure #purity of all branches
        self.split_col = split_col # column used for splitting data

    def __str__(self):
        print(self.data)
        return ''

    def __repr__(self):
        print(self.data)
        return ''

    def _split(self,column):
        self.split_col = column
        splt = split(self.data,column)    
        for i in splt: 
            self.branch.append( Node(i) ) #create notes with corresponding data
        print(self.branch)    
    
    # check purity of branch
    def _check_purity(self,target = None):
        if target == None: target = self.data.columns[-1]
        result = purity_check(self.data,target)
        self.pure = result #update purity
        return result

    #check purity of all branches
    def _pure_branches(self):
        for branch in self.branch:
            branch._check_purity()
        res = True if all([i.pure for i in self.branch]) else False
        self.pure = res
        return self.pure

    def get_impure_nodes(self):
        # update inpurities
        for node in self.branch:
            self._check_purity()
        return [i for i in range(len(self.branch)) if not self.branch[i].pure]
       

## Toy implementation
Below is a toy dataset we can manipulate by hand.

In [None]:
# some toy data for experimentation
df = pd.DataFrame({'age':list('y'*5+'m'*5+'o'*5),
                         'has_job':list('ffttffftffffttf'),
                         'own_house':list('ffftffftttttfff'),
                         'credit_rating':list('fggfffggeeeggef'),
                         'class_':list('nnyynnnyyyyyyyn')})
df

In [None]:
# let us give the variable levels friendlier names
df.age.replace({'y':'young','m':'middle','o':'old'},inplace=True)
df.has_job.replace({'f':False,'t':True},inplace=True)
df.own_house.replace({'f':False,'t':True},inplace=True)
df.credit_rating.replace({'f':'fair','g':'good','e':'excellent'},inplace=True)
df.class_.replace({'n':'No','y':'Yes'},inplace=True)
df

In [None]:
# create the root node and pass in data
root=Node(df)
root.__repr__

In [None]:
# take a look at the data in the root node
root.data

In [None]:
# take a look at the branches
# there are none as no splits have occurred
root.branch

In [None]:
root.pure

In [None]:
# What variable to use for split?
vartosplit = gains(root.data,root.data.columns[:-1],root.data.columns[-1])
vartosplit

In [None]:
#let us split the data in root node at vartosplit
root._split(vartosplit)

In [None]:
# Take a look at the branches
root.branch

In [None]:
# are the branches pure?
root.branch[0]._check_purity('class')


In [None]:
# is the other branch pure? (You can clearly see that it is)
root.branch[1]._check_purity('class')

In [None]:
# Get impure nodes (indices here) to continue to split
imp_node = root.get_impure_nodes()
imp_node

In [None]:
# Where to split?
vartosplit = gains(root.branch[0].data,root.branch[0].data.columns[:-1],'class')
vartosplit

In [None]:
root.branch[0]._split(vartosplit)

In [None]:
root.branch[0].branch