<a href="https://colab.research.google.com/github/alexjohnson21/ubiquitous-sniffle/blob/master/cse450_prove04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04 Prove - Decision Tree Classifier

## Setup and imports

In [0]:
!pip install decision-tree-id3

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from id3 import Id3Estimator
from id3 import export_graphviz

In [0]:
from google.colab import files
uploaded = files.upload()

In [103]:
iris = pd.read_csv("iris.data")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Preprocessing

### Check for missing vals

In [104]:
# Show total missing values on each row (axis=0)
iris.isna().sum(axis=0)

# Result: no missing values

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

### Check types of data present

In [105]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

### Split the dataset into features and targets (X and Y)

In [106]:
# Integer-based location - all rows, all columns except the last
X = iris.iloc[:, :-1]
X_names = X.columns

# All rows, only last column
Y = iris.iloc[:, -1]

# SKLearn likes numpy arrays
X = X.to_numpy()
Y = Y.to_numpy()

X_names = X_names.to_numpy()
X_names

array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
      dtype=object)

## Create training and testing sets

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

## Create decision tree and experiment with it
*ID3-based decision tree written by svaante. [View on GitHub](https://github.com/svaante/decision-tree-id3)*

### Default ID3 tree - no specified parameters

In [108]:
dec_tree = Id3Estimator()
dec_tree.fit(x_train, y_train)

Id3Estimator(gain_ratio=False, is_repeating=False, max_depth=None,
             min_entropy_decrease=0.0, min_samples_split=2, prune=False)

![Default ID3 tree structure](https://drive.google.com/uc?export=view&id=160H6SkdZkK5qmqiIkXtxQmGsBMB3RXTG)

In [109]:
y_pred = dec_tree.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy

0.8666666666666667

### ID3 tree with pruning

In [110]:
dec_tree = Id3Estimator(prune=True)
dec_tree.fit(x_train, y_train)

Id3Estimator(gain_ratio=False, is_repeating=False, max_depth=None,
             min_entropy_decrease=0.0, min_samples_split=2, prune=True)

![Pruned ID3 tree structure](https://drive.google.com/uc?export=view&id=1-1ycnS8FQYWWNbENIGdQJQFCyikCnPuY)

In [111]:
y_pred = dec_tree.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy

0.9

### ID3 tree with depth limit

In [112]:
dec_tree = Id3Estimator(max_depth=4)
dec_tree.fit(x_train, y_train)

Id3Estimator(gain_ratio=False, is_repeating=False, max_depth=4,
             min_entropy_decrease=0.0, min_samples_split=2, prune=False)

![Limited ID3 tree structure](https://drive.google.com/uc?export=view&id=1-ChWl7s4Kq9mwk_vJsqtzsPrHLNuPLUN)

In [113]:
y_pred = dec_tree.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy

0.8666666666666667

## A script to automate dotfile creation and png export.
I'm kind of proud of this.

In [0]:
times_exported
output_name = "dec_tree_" + str(times_exported)
image_name = output_name + ".png"

export_graphviz(dec_tree.tree_, output_name  + ".dot", feature_names=X_names)

output_name = output_name + ".dot"
times_exported += 1

!dot -T png $output_name -o $image_name
!rm $output_name