**Author:** Felipe Lodur

# Trees

Implementing and experimenting Decision Tree, Bagging Trees and Random Forest.

### Data Format
- last column of the data frame must contain the label and it must also be called "label"
- there should be no missing values in the data frame

In [5]:
import pandas as pd

df = pd.read_csv("Iris.csv")
df = df.drop("Id", axis=1)
df = df.rename(columns={"species": "label"})
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Train-test split:

In [6]:
import random

def train_test_split(df, test_size):
    ''' test_size can be presented as:
            float: percentage of data to be used as test set
            int: number of instances to b used as test set
    '''
    if isinstance(test_size, float):
        test_size = round(test_size * len(df))

    indices = df.index.tolist()
    test_indices = random.sample(population=indices, k=test_size)

    test_df = df.loc[test_indices]
    train_df = df.drop(test_indices)
    
    return train_df, test_df

random.seed(42) # as always :p
train_df, test_df = train_test_split(df, test_size=20)

## 1. Decision Tree

In [7]:
from DecisionTree import DecisionTree

dt = DecisionTree(max_depth = 2, min_samples = 5, impurity = 'entropy')
dt.fit(train_df)
dt.predict(test_df)

Unnamed: 0,pred,proba
28,Iris-setosa,{'Iris-setosa': 1.0}
6,Iris-setosa,{'Iris-setosa': 1.0}
70,Iris-virginica,{'Iris-virginica': 1.0}
62,Iris-versicolor,"{'Iris-versicolor': 0.9148936170212766, 'Iris-..."
57,Iris-versicolor,"{'Iris-versicolor': 0.9148936170212766, 'Iris-..."
35,Iris-setosa,{'Iris-setosa': 1.0}
26,Iris-setosa,{'Iris-setosa': 1.0}
139,Iris-virginica,{'Iris-virginica': 1.0}
22,Iris-setosa,{'Iris-setosa': 1.0}
108,Iris-virginica,{'Iris-virginica': 1.0}


## 2. BaggingTrees

In [8]:
from BaggingTrees import BaggingTrees

bt = BaggingTrees(n_estimators=10, bootstrap_value=1.0, # bagging params
                  max_depth = 2, min_samples = 5, impurity = 'entropy') # dt params
bt.fit(train_df)
bt.predict(test_df)

Unnamed: 0,pred,Iris-setosa,Iris-versicolor,Iris-virginica
28,Iris-setosa,1.0,0.0,0.0
6,Iris-setosa,1.0,0.0,0.0
70,Iris-virginica,0.0,0.183239,0.816761
62,Iris-versicolor,0.0,0.942537,0.057463
57,Iris-versicolor,0.0,0.942537,0.057463
35,Iris-setosa,1.0,0.0,0.0
26,Iris-setosa,1.0,0.0,0.0
139,Iris-virginica,0.0,0.0,1.0
22,Iris-setosa,1.0,0.0,0.0
108,Iris-virginica,0.0,0.0,1.0
