# <font color='red'>Project IRIS: with sklearn</font>

In this tutorial you will discover **how to use sklearn to run a decision tree classifier in machine learning**. 

Goals:
* How to load data from CSV and make it available to you
* Go straight to create a model and run a classification

# Description of the input data

The iris flowers dataset is a standard ML dataset, widely used worldwide as benchmark.

### Dataset availability:

Almost ubiquitous.. e.g.
   * [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)

More info:
   * [UCI Machine Learning Repository page](https://archive.ics.uci.edu/ml/datasets/Iris).

Alternatively:
   * get it from [https://github.com/bonacor/CorsoSwComp](https://github.com/bonacor/CorsoSwComp) by importing into from Google Colab
      * direct URL to the dataset: [https://github.com/bonacor/CorsoSwComp/blob/master/iris.data.csv](https://github.com/bonacor/CorsoSwComp/blob/master/iris.data.csv)

### Dataset description:

* This is a good example to practice on a multiclass classification problem.
* Each instance describes the properties of an observed iris flower measurements
* All of the 4 input variables are numeric and have the same scale (cm)
   * Sepal length in centimeters 
   * Sepal width in centimeters 
   * Petal length in centimeters 
   * Petal width in centimeters
* The output variable is a specific iris species (3 possibilities)
   * the "class", e.g. "Iris-setosa", "Iris-versicolor" or "Iris-verginica"

### How the dataset looks like:


    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    (...)


### Additional input from best practitioners:

The iris flower dataset is a well studied problem and as such we can expect to achieve a model accuracy in the range of 95% to 97%. USe this as target to aim for when developing your model(s).

# Set-up

## Import Classes and Functions

Start by importing all classes and functions you will need:
* data loading functionalities from **Pandas**
* data preparation and model evaluation from **scikit-learn**

In [None]:
# pandas
from pandas import read_csv
# sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Data import

In [None]:
#today, get it from here for example:
!wget https://raw.githubusercontent.com/bonacor/CorsoSwComp/master/iris.data.csv

In [None]:
!ls -trl iris.data.csv

In [None]:
!head -5 iris.data.csv

In [None]:
# load dataset
dataframe = read_csv("iris.data.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)   # columns from 1st to 4th into X
Y = dataset[:,4]                   # column 5th into Y

In [None]:
len(X)

In [None]:
len(Y)

In [None]:
X

In [None]:
Y

# Model creation

In [None]:
#iris = load_iris()
#X = iris.data[:, 2:] # petal length and width
#y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=3)
tree_clf.fit(X, Y)

# Exercise 1: Run model prediction (difficulty: easy)

Use sklearn documentation to write simple commands able to:
   * tell you the probability classes of this observation: 8.0, 3.0, 6.0, 2.0
   * tell you the actual class 

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
## INSERT YOUR CODE HERE

# Exercise 2: Repeat with less features (difficulty: moderate)

Modify this notebook to run Exercise 1 by having a dataset in input that only has petal width and petal lenght (i.e. the first 2 features in the columns you have been given), instead of the full 4 features available.