# Supervised Machine Learning: Decision Tree

A decision tree is a type of supervised learning algorithm that can be used in classification as well as regressor problems. The input to a decision tree can be both continuous as well as categorical. The decision tree works on an if-then statement.

1. Initially all the training set is considered as a root. 
2. Feature values are preferred to be categorical, if continuous then they are discretized.
3. Records are distributed recursively on the basis of attribute values. 
4. Which attributes are considered to be in root node or internal node is done by using a statistical approach.

In [23]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from IPython.display import Image 
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
from io import StringIO
from sklearn import tree
import matplotlib

## 1. Load Data

In [24]:
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)

In [25]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [26]:
y = pd.get_dummies(y)
y.head()

Unnamed: 0,setosa,versicolor,virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## 2. Decision Tree

In [28]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [29]:
#dot_data = StringIO()
#export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)
#(graph, ) = graph_from_dot_data(dot_data.getvalue())
#Image(graph.create_png())

## 3. Key concepts

There are different attributes which define the split of nodes in a decision tree. 

* **Entropy**: measure of the amount of uncertainty (impurity) in the dataset. Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.
* **Information Gain**: based on the decrease in entropy after a data-set is split on an attribute. A decision tree tries to find the attribute that returns the highest information gain.


There are few algorithms to find the optimal split:

* **ID3 Algorithm** (Iterative Dichotomiser 3): This solution uses Entropy and Information gain as metrics to form a better decision tree. The attribute with the highest information gain is used as a root node, and a similar approach is followed after that. A leaf node is decided when entropy is zero.
    1. Compute the entropy for the dataset
    2. For every attribute:
        * Calculate entropy for all categorical values.
        * Take average information entropy for the attribute.
        * Calculate gain for the current attribute.
    3. Pick the attribute with the highest information gain.
    4. Repeat until we get the desired tree.
    

* **CART Algorithm** (Classification and Regression trees):