# Project IRIS - Phase 1: sklearn</font>


In this tutorial you will discover **how to use sklearn to run a decision tree classifier in machine learning**. 

Goals:
* How to load data from CSV and make it available to you
* Go straight to create a model and run a classification

# <font color='blue'>A. Description of the input data

In [0]:
!wget http://bonacor.web.cern.ch/bonacor/SC_AA1920/images/iris.png
from IPython.display import Image
Image(filename='/content/iris.png')

The iris flowers dataset is a standard ML dataset, widely used worldwide as benchmark.

### Dataset availability:

Almost ubiquitous.. e.g.
   * [UCI Machine Learning Repository page](https://archive.ics.uci.edu/ml/datasets/Iris)

More easily, get it from the github of these hands-on, at this direct URL:
   * [https://raw.githubusercontent.com/dbonacorsi/SC_AA1920/master/datasets/iris.data.csv](https://raw.githubusercontent.com/dbonacorsi/SC_AA1920/master/datasets/iris.data.csv)

### Dataset description:

* This is a good example to practice on a multiclass classification problem.
* Each instance describes the properties of an observed iris flower measurements
* All of the 4 input variables are numeric and have the same scale (cm)
   * Sepal length in centimeters 
   * Sepal width in centimeters 
   * Petal length in centimeters 
   * Petal width in centimeters
* The output variable is a specific iris species (3 possibilities)
   * the "class", e.g. "Iris-setosa", "Iris-versicolor" or "Iris-verginica"

### How the dataset looks like:


    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    (...)


### Additional input from best practitioners:

The iris flower dataset is a well studied problem and as such we can expect to achieve a model accuracy in the range of 95% to 97%. Use this as target to aim for when developing your model(s).

# <font color='blue'>B. Data import (+ quick data exploration) + data preparation

Start by importing all classes and functions you will need:
* data loading functionalities from **Pandas** (learn more [here](https://pandas.pydata.org/))
* data preparation and model evaluation from **Scikit-learn** - referred to as `sklearn` in the following (learn more [here](https://scikit-learn.org/stable/))

In [0]:
# pandas
from pandas import read_csv

# sklearn
# ... later ...

Download and import the data. You can do it in various ways. 

In [0]:
# you can directly download the data and inspect it

# --- download
#!wget https://raw.githubusercontent.com/dbonacorsi/SC_AA1920/master/datasets/iris.data.csv
#!ls -trl iris.data.csv
#!head -5 iris.data.csv
# --- import
#dataframe = read_csv("iris.data.csv", header=None)
#dataset = dataframe.values

# but we do something slightly more sophisticated - see below.

The output variable contains strings, so it is suggested (easiest) to load the data using **pandas** into a DataFrame.

In [0]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/SC_AA1920/master/datasets/iris.data.csv'

names = ['sepal-l', 'sepal-w', 'petal-l', 'petal-w', 'class']
dataset = pd.read_csv(url, names=names)
dataset

In [0]:
shape = dataset.shape
shape

In [0]:
class_counts = dataset.groupby('class').size()
print(class_counts)

In [0]:
from pandas import set_option

set_option('display.width', 200)
set_option('display.max_rows', 500)
set_option('display.max_columns', 500)
set_option('precision', 3)        

description = dataset.describe()
print(description)

In [0]:
from matplotlib import pyplot

dataset.hist()
pyplot.rcParams["figure.figsize"] = [8,8]
pyplot.show()

Now, split the attributes (i.e. columns) into input variables (the matrix of **features X**) and output variables (the vector **label Y**).

In [0]:
data = dataset.values
data

In [0]:
X = data[:,0:4].astype(float)   # columns from 1st to 4th into X
Y = data[:,4]                   # column 5th into Y

In [0]:
len(X)

In [0]:
len(Y)

In [0]:
X

In [0]:
Y

# <font color='blue'>C. Model creation, training and inference

In a real ML project, you will spent **plenty** of time before getting to this point, in **data exploration** as well as **data preparation**, keys towards success. Nevertheless, in this example, data is relatively easy, and already clean and ready to use, so you can go straight to creating a ML model.

In [0]:
#iris = load_iris()
#X = iris.data[:, 2:] # petal length and width
#y = iris.target

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, Y)

Learn more in the sklearn official documentation for [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

Note that we used the fit paradigm of sklearn, and not the fit_transform one. Discussion in the virtual lab.

# <font color='red'>Exercise 1: Run model prediction (difficulty: easy)

Explore the sklearn documentation to find a way to:
   * extract the probability classes of the following observation: 6.3, 2.5, 4.9, 1.5
   * extract the predicted class 

## <font color='green'> Solution 1

In [0]:
## INSERT YOUR CODE HERE

# <font color='red'>Exercise 2: Repeat with less features (difficulty: moderate)

1.   Modify this notebook to run Exercise 1 by having a dataset in input that only has **sepal-l** and **sepal-w** (i.e. the first 2 features in the columns you have been given), instead of the full 4 features available. Then, redo the modelling and prediction. Do you see any change? Why?
2.   Same as above, by using only the last 2 features instead (**petal-l** and **petal-w**). Do you see any change? Why? 




## <font color='green'>Solution 2

In [0]:
## INSERT YOUR CODE HERE

# <font color='red'>Exercise 3: Import data differently (difficulty: easy)

The iris dataset is so famous, and sklearn so widespread, that you can import the iris dataset directly from sklearn. Just try the following and compare with what we did before.
```
from  sklearn import  datasets
iris=datasets.load_iris()

X=iris.data
Y=iris.target
```


## <font color='green'>Solution 3

In [0]:
## INSERT YOUR CODE HERE

# <font color='red'>Exercise 4: Perform train/test splitting (difficulty: moderate)

## <font color='green'>Solution 4

In [0]:
## INSERT YOUR CODE HERE

# <font color='red'>Exercise 5: Try a different model

Is a Decision Tree the best choice? Try a different model, e.g. a KNeighborsClassifier - find in sklearn documentation how to use it!

## <font color='green'>Solution 5

In [0]:
## INSERT YOUR CODE HERE

# <font color='red'>Exercise 6: Hyperparameter tuning (difficulty: moderate to high)



## <font color='green'>Solution 6

In [0]:
## INSERT YOUR CODE HERE

In [0]:
from sklearn.model_selection import GridSearchCV
clf = tree.DecisionTreeClassifier(random_state=123)
grid_values = {'max_depth': [2,5,10,20,30],'min_samples_split':[2,3,4,5], 'min_samples_leaf':[2,3,4,5]}
grid = GridSearchCV(clf, param_grid = grid_values,scoring = 'accuracy')
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))