# June 3: Classification With Decision Trees

## Data Loading

[Iris Flower Dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper.

In [0]:
#@title
from sklearn import datasets

iris = datasets.load_iris()
X = iris['data']
y = iris['target']

## Data Exploration
#### Features

In [0]:
iris.feature_names

<a href="https://imgbb.com/"><img src="https://i.ibb.co/mDC1KSt/petal-sepal.png" alt="petal-sepal" border="0"></a>

#### Target Labels

In [0]:
iris.target_names

<a href="https://ibb.co/8sJhx0R"><img src="https://i.ibb.co/h9kqdyn/iris.png" alt="iris" border="0"></a>

#### Dataset Size

In [0]:
len(iris.data)

#### Visualizations

In [0]:
from matplotlib import pyplot as plt
%matplotlib inline

colors = ['r', 'y', 'b']
fig, ax = plt.subplots(1, 4, figsize=(20,4))
cols = [[0, 1], [2, 3], [0, 2], [1, 3]]

for n in range(len(ax)):
    for t in range(len(iris.target_names)):
        ax[n].scatter([iris.data[i, cols[n][0]] for i,v in enumerate(iris.target) if v==t], \
                [iris.data[i, cols[n][1]] for i,v in enumerate(iris.target) if v==t], c=colors[t])
    ax[n].set_title('{} Vs {}'.format(iris.feature_names[cols[n][0]], iris.feature_names[cols[n][1]]))
    ax[n].set_xlabel(iris.feature_names[cols[n][0]])
    ax[n].set_ylabel(iris.feature_names[cols[n][1]])

fig.legend(labels=['setosa', 'versicolor', 'virginica'], ncol=1, loc='right', fontsize=15)
plt.show()

## Modeling A Classifier

### See [sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) docs for parameter list, attributes etc. 

The most "relevant" parameters:
* **criterion** : _The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Default is "gini"._

* **splitter** : _The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split. Default is "best"._

* **max_depth** : _The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Default is `None`._

* **min_samples_split** : _The minimum number of samples required to split an internal node. Default = `2`._

* **min_samples_leaf** : _The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least `min_samples_leaf` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. Default is `1`_

## TO DO: Train/Test The Data

In [0]:
# TO DO: Split the data into test (20%) and train sets.
############ Your code goes here ############


## TO DO: Training

In [0]:
# TO DO: Train a DecisionTreeClassifier model on train set.

# Parameters
criterion = 'gini'
splitter = 'best'
max_depth = None
min_samples_split = 2
min_samples_leaf = 1

############ Your code goes here ############


## TO DO: Testing

In [0]:
# TO DO: Test the performance of the model trained on the test set - both accuracy score and classification report should be printed.
############ Your code goes here ############

## Visualizing The Decision Tree Learnt

In [0]:
import graphviz

model = # Your model variable
dot_data = tree.export_graphviz(model, out_file=None, 
                      feature_names=iris.feature_names,  
                      class_names=iris.target_names,  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

## TO DO: Decision Boundary Visualization

Write training code missing in the cell

In [0]:
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import numpy as np

# Hyper-parameters
criterion = 'gini'
splitter = 'best'
max_depth = None
min_samples_split = 2
min_samples_leaf = 1

# Parameters
n_classes = 3
plot_colors = "ryb"
plot_step = 0.02
plt.figure(figsize=(16, 4))

for pairidx, pair in enumerate([[0, 1], [2, 3], [0, 2], [1, 3]]):
    # Considers only pairs of features for 2D plotting
    X = iris.data[:, pair]
    y = iris.target
    
    # Training Code
    ############ Your code goes here ############
    
    
    # Plot the decision boundary
    plt.subplot(1, 4, pairidx + 1)
    

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

## TO DO: Implement Gini-Index From Scratch

In this exercise, you will get your hands dirty and write a method that outputs Gini-index given a split of a toy dataset. 

### Visualizing The Toy-Data

In [0]:
import pandas as pd

toy_data = [[4.484812507,3.4983517180,0],
            [3.442139098,2.883329202,0],
            [5.391887635,4.526381359,0],
            [5.674611146,4.333518109,0],
            [4.712776711,3.922582001,0],
            [9.211113656,4.876521335,1],
            [10.71577104,5.052614977,1],
            [9.158110115,2.190251164,1],
            [11.83850681,4.948118771,1],
            [8.355855140,5.033551554,1]]

toy_data_pd = pd.DataFrame(toy_data, columns=['x1', 'x2', 'class_id'])
plt.scatter(toy_data_pd.x1[toy_data_pd.class_id == 0], toy_data_pd.x2[toy_data_pd.class_id == 0], c='r')
plt.scatter(toy_data_pd.x1[toy_data_pd.class_id == 1], toy_data_pd.x2[toy_data_pd.class_id == 1], c='b')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend(labels=[0,1])

### 1. Method `data_split` takes the dataset and splits it at provided `index` if data less than `value`.  

### 2. Method `get_best_split` is a driver method that calclulates `gini_index` on every possible split.

### 3. (TO DO) Method `gini_index`, takes `groups` and `classes` info and calculates Gini-index.

If a target is a classification outcome taking on values 0,1,…,K-1, for node , representing a region  with observations, let

>$p_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)$
  
be the proportion of class $k$ observations in node $m$.
Common measures of impurity are:

* Gini
>$H(X_m) = \sum_k p_{mk} (1 - p_{mk})$

* Entropy
>$H(X_m) = - \sum_k p_{mk} \log(p_{mk})$


Weighting the impurity across $k$ classes:
>$G(Q) = \frac{n_{left}}{N_m} H(Q_{left})
   + \frac{n_{right}}{N_m} H(Q_{right})$

In [0]:
# A method that does the array split
def data_split(index, value, data):
  left, right = list(), list()
  for row in data:
    if row[index] < value:
      left.append(row)
    else:
      right.append(row)
  return left, right

# Output the best split point for the data
def get_best_split(data):
  class_values = list(set(row[-1] for row in data))
  best_index, best_value, best_score, best_groups = 999, 999, 999, None
  for index in range(len(dataset[0])-1):
    for row in data:
      groups = data_split(index, row[index], data)
      gini = gini_index(groups, class_values)
      print('X%d < %.3f Gini=%.3f' % ((index+1), row[index], gini))
      if gini < best_score:
        best_index, best_value, best_score, best_groups = index, row[index], gini, groups
  return {'index':best_index, 'value':best_value, 'groups':best_groups}\

# (TO DO) Calculate the Gini index for a split dataset
def gini_index(groups, classes):
  ############ Your code goes here ############
  
  return gini_value



In [0]:
split = get_best_split(toy_data)
print('Best Split: [X%d < %.3f]' % ((split['index']+1), split['value']))

In [0]:
plt.scatter(toy_data_pd.x1[toy_data_pd.class_id == 0], toy_data_pd.x2[toy_data_pd.class_id == 0], c='r')
plt.scatter(toy_data_pd.x1[toy_data_pd.class_id == 1], toy_data_pd.x2[toy_data_pd.class_id == 1], c='b')
# If best split is on X1, uncomment next line.
plt.axvline(x=split['value'])
# If best split is on X2, uncomment next line.
# plt.axhline(y=split['value'])
plt.show()