# Comparing the Performance of a Decision Tree Classifier
## Criterion: Entropy and Gini Index
## Performance Metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE)

When building a decision tree classifier to model a data set (such as iris data), the goal is to maximize the information gained after the data is split into two nodes. Information gained can be calculated using two different criterion: (1) entropy and (2) the gini index. Both criterion are supported by the decision tree classifier class in sklearn.

\begin{gather*}
entropy= \sum_n p_n * log_2 ( p_n)\\
gini \: index = 1 - \sum_n p_n^2\\
p_n= probability
\end{gather*}


Once a model is determined from the training data, using a specific criterion, it need to be evalued based on how close its predctions are to the validation data. "Closeness" can be determined using two different metrics: (1) Mean Absolute Error and (2) mean squared error.


\begin{gather*}
MAE=\frac{\sum_{n=1}^N|y_i - x_i|}{N}\\
MSE=\frac{\sum_{n=1}^N(y_i - x_i)^2}{N}\\
x_i = predicted \\
y_i = true \; value \\
N = total \: number \: of \: data \: points
\end{gather*}


In [2]:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

# Train the model on a portion of the data.
# To figure this out flow the intro to ML Keggle tutorial

Alternatively the iris dataset can be imported into a DataFrame from a CSV file using Pandas. 

In [3]:
#import pandas as pd
#import numpy as np

#col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']
#data = pd.read_csv('GitHub_Iris.csv')
#data = pd.read_csv('Kaggle_Iris.csv')

#data.head()

#data.iloc[:, -1].value_counts()


#X = data.iloc[:, :-1]
#y = data.iloc[:, -1:]

## Split into Training and Validation Datasets

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=42, test_size=0.33)

## criterion='gini'

In [9]:
model_gini = tree.DecisionTreeClassifier(criterion='gini' , random_state=1)
model_gini.fit(train_X, train_y)

val_predictions = model_gini.predict(val_X)

gini_MAE = mean_absolute_error(val_y, val_predictions)
gini_MSE = mean_squared_error(val_y, val_predictions)

print('Mean absolute error is:' , gini_MAE)
print('Mean squared error is:' , gini_MSE)

Mean absolute error is: 0.02
Mean squared error is: 0.02


## criterion='entropy'

In [10]:
model_entropy = tree.DecisionTreeClassifier(criterion='entropy',random_state=1)
model_entropy.fit(train_X, train_y)

val_predictions = model_entropy.predict(val_X)

entropy_MAE = mean_absolute_error(val_y, val_predictions)
entropy_MSE = mean_squared_error(val_y, val_predictions)

print('Mean absolute error is:' , entropy_MAE)
print('Mean squared error is:' , entropy_MSE)

Mean absolute error is: 0.02
Mean squared error is: 0.02


## Mean Squared Error

In [11]:
entropy_MSE == gini_MAE

True

# Summary

Write a summary ...

Unlikely senerio