# <font color='red'>Project IRIS: with sklearn</font>

In this tutorial you will discover **how to use sklearn to run a decision tree classifier in machine learning**. 

Goals:
* How to load data from CSV and make it available to you
* Go straight to create a model and run a classification

# Description of the input data

The iris flowers dataset is a standard ML dataset, widely used worldwide as benchmark.

### Dataset availability:

Almost ubiquitous.. e.g.
   * [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)

More info:
   * [UCI Machine Learning Repository page](https://archive.ics.uci.edu/ml/datasets/Iris).

Alternatively:
   * get it from [https://github.com/bonacor/CorsoSwComp](https://github.com/bonacor/CorsoSwComp) by importing into from Google Colab
      * direct URL to the dataset: [https://github.com/bonacor/CorsoSwComp/blob/master/iris.data.csv](https://github.com/bonacor/CorsoSwComp/blob/master/iris.data.csv)

### Dataset description:

* This is a good example to practice on a multiclass classification problem.
* Each instance describes the properties of an observed iris flower measurements
* All of the 4 input variables are numeric and have the same scale (cm)
   * Sepal length in centimeters 
   * Sepal width in centimeters 
   * Petal length in centimeters 
   * Petal width in centimeters
* The output variable is a specific iris species (3 possibilities)
   * the "class", e.g. "Iris-setosa", "Iris-versicolor" or "Iris-verginica"

### How the dataset looks like:


    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    (...)


### Additional input from best practitioners:

The iris flower dataset is a well studied problem and as such we can expect to achieve a model accuracy in the range of 95% to 97%. USe this as target to aim for when developing your model(s).

# Set-up

## Import Classes and Functions

Start by importing all classes and functions you will need:
* data loading functionalities from **Pandas**
* data preparation and model evaluation from **scikit-learn**

In [1]:
# pandas
from pandas import read_csv
# sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Data import

In [2]:
#today, get it from here for example:
!wget https://raw.githubusercontent.com/bonacor/CorsoSwComp/master/iris.data.csv

--2019-05-23 16:33:00--  https://raw.githubusercontent.com/bonacor/CorsoSwComp/master/iris.data.csv
Resolving raw.githubusercontent.com... 151.101.240.133
Connecting to raw.githubusercontent.com|151.101.240.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [text/plain]
Saving to: 'iris.data.csv.1'


2019-05-23 16:33:01 (20.2 MB/s) - 'iris.data.csv.1' saved [4551/4551]



In [3]:
!ls -trl iris.data.csv

-rw-r--r--  1 bonacor  staff  4551 May 23 11:03 iris.data.csv


In [4]:
!head -5 iris.data.csv

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa


In [52]:
# load dataset
dataframe = read_csv("iris.data.csv", header=None)
dataset = dataframe.values
X = dataset[:,2:4].astype(float)   # columns from 1st to 4th into X
Y = dataset[:,4]                   # column 5th into Y

In [53]:
len(X)

150

In [54]:
len(Y)

150

In [55]:
X

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1],
       [1.5, 0.2],
       [1.6, 0.2],
       [1.4, 0.1],
       [1.1, 0.1],
       [1.2, 0.2],
       [1.5, 0.4],
       [1.3, 0.4],
       [1.4, 0.3],
       [1.7, 0.3],
       [1.5, 0.3],
       [1.7, 0.2],
       [1.5, 0.4],
       [1. , 0.2],
       [1.7, 0.5],
       [1.9, 0.2],
       [1.6, 0.2],
       [1.6, 0.4],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.6, 0.2],
       [1.6, 0.2],
       [1.5, 0.4],
       [1.5, 0.1],
       [1.4, 0.2],
       [1.5, 0.1],
       [1.2, 0.2],
       [1.3, 0.2],
       [1.5, 0.1],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.3, 0.3],
       [1.3, 0.3],
       [1.3, 0.2],
       [1.6, 0.6],
       [1.9, 0.4],
       [1.4, 0.3],
       [1.6, 0.2],
       [1.4, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [4.7, 1.4],
       [4.5, 1.5],
       [4.9,

In [48]:
Y

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versic

# Model creation

In [56]:
#iris = load_iris()
#X = iris.data[:, 2:] # petal length and width
#y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, Y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# Exercise 1: Run model prediction (difficulty: easy)

Use sklearn documentation to write simple commands able to:
   * tell you the probability classes of this observation: 8.0, 3.0, 6.0, 2.0
   * tell you the actual class 

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [11]:
## INSERT YOUR CODE HERE

In [50]:
tree_clf.predict_proba([[8,3,6,2]])

array([[0.        , 0.02173913, 0.97826087]])

In [51]:
tree_clf.predict([[8,3,6,2]])

array(['Iris-virginica'], dtype=object)

# Exercise 2: Repeat with less features (difficulty: moderate)

Modify this notebook to run Exercise 1 by having a dataset in input that only has petal width and petal lenght (i.e. the first 2 features in the columns you have been given), instead of the full 4 features available.

In [57]:
tree_clf.predict_proba([[6,2]])

array([[0.        , 0.02173913, 0.97826087]])

In [58]:
tree_clf.predict([[6,2]])

array(['Iris-virginica'], dtype=object)