## IRIS Dataset
Picture and text published from http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/
![title](iris.png)

Three Iris varieties were used in the Iris flower data set outlined by Ronald Fisher in his famous 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” PDF. Since then his paper has been cited over 2000 times and the data set has been used by almost every data science beginner.

The data set consists of:

150 samples
3 labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)

4 features: length and the width of the sepals and petals, in centimetres

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
random_state = 42



### Load Dataset


sklearn libraray comes with IRIS data pre-loaded, which we would convert into a pandas dataframe


In [None]:
from sklearn import datasets
def get_iris_dataset():
    iris = datasets.load_iris()
    iris_df = pd.DataFrame(iris.data, columns= iris.feature_names)
    iris_df['target'] = iris.target
    print('shape', iris_df.shape)
    iris_df.head( )
    return iris_df, iris.target_names



### Target Values

The target field in data represents the  species of flower which are mapped as mapped as 
<br>setosa : 0, 
<br>versicolor: 1 
<br>virginica: 2.

We can see  that there are 50  samples of each species

### Prepare Data for Training
We will only use two features  petal length  and petal width from the data so that it can be plotted on the graph. Also we will use  all three classes of flowers . This will be a multiclass classification problem to solve as we are predicted more than 2 classes

#### Split Data into  training and test set
Split the data into training set and test set such that number of samples in training and test set are in 80:20 ratio. The model will be trained on training set and then evalauted on test set

### Plot Training Data
There are three classes of flower, each class is represented by different color and plotted for two of the chosen features on x and y axis

In [None]:
def display_classes(X_train, y_train, columns):
    train_df = X_train.copy()
    train_df['target'] = y_train
    plt.figure(figsize = (10, 6))
#     ax = sns.scatterplot(x= columns[0], y= columns[1], hue="target", 
#                          style ='target', palette="Set1", data= train_df)
    train_0 = train_df[train_df['target'] == 0]
    train_1 = train_df[train_df['target'] == 1]
    train_2 = train_df[train_df['target'] == 2]
    plt.scatter(x =train_0[columns[0]], y=  train_0[columns[1]], label = '0_setosa' )
    plt.scatter(x =train_1[columns[0]], y=  train_1[columns[1]], label = '1_versicolor' )
    plt.scatter(x =train_2[columns[0]], y=  train_2[columns[1]], label = '2_virginica' )
    plt.legend( )
    plt.show()



## Train the Model 
Train the model using traning set data and calling fit method

The model documentaion can be accessed using link 
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

#### Display the tree splits
The decison to split based on a fearure can be visualized in this graph. The feature is chosen, which gives best split and is measured by gini. The lower the gini score better the classification. A gini score of 0 means perfect calssification meaning all samples belong to a single class. The decsion trees grow their nodes to achive minimum gini score after each split. All features are scanned to slect the feature which gives best split


In [None]:
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus

def get_tree_graph(clf):
    dot_data = StringIO()
    export_graphviz(clf, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    return graph




#### Plot Tree Decision Boundaries on training set

In [None]:
def plot_train_decision_boundary(X_train, y_train, model):
    n_classes = 3
    plot_colors = "ryb"
    plot_step = 0.02

    X = X_train.values
    y = y_train.values
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))


    plt.figure(figsize=(8,6))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)
    
    features = X_train.columns.tolist()

    plt.xlabel(features[0])
    plt.ylabel(features[1])

    # # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

    plt.show()    



#### Training Accuracy


#### Plot Tree Decision Boundaries on Test set

In [None]:
def plot_test_decision_boundary(X_train, y_train, X_test, y_test, clf):
    n_classes = 3
    plot_colors = "ryb"
    plot_step = 0.02

    X = X_train.values
    y = y_train.values
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))


    plt.figure(figsize=(8,6))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)
    
    features = X_train.columns.tolist()

    plt.xlabel(features[0])
    plt.ylabel(features[1])

    # # Plot the test points
    X = X_test.values
    y = y_test.values
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,# label= features[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

    plt.show()



#### Test Set Prediction and Accuracy

## Train with New Hyperparameters
Let's try to improve accuracy for the test set. The previous model had good training accuracy, but slightly worse accuracy on the test set. This means our model overfitted on training data. We will now train same model using different hyperparameters to reduce overfitting so that our model generalizes better on unseen data (Test data)

#### Display the tree splits

#### Training Accuracy

#### Plot Tree Decision Boundaries on training set


#### Test Accuracy
By limiting the freedom of the the model by reducing the depth to 3 we are able to get better accuracy. By defualt depth of the tree is unlimited as was the case in our previous model. You can achieve similar results by tweaking the other hyperparameters like min_samples_leaf or min_samples_split. Welcome to world of hyperparameters tuning which will play important role going forward

#### Plot Tree Decision Boundaries on Test Set