# Ref Book: 

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits: A practical guide to implementing supervised and unsupervised machine learning algorithms in Python

Tarek Amr - 2020

## Chapter 2 - Iris Dataset (Scikit-learn)
 - Classification problem
 - Three species are covered: Setosa, Versicolor, and Virginica. 
 - FEATURES: length and the widths of the sepal and petal of each plant 
 - TARGET: Setosa, a Versicolor, or a Virginica 
 
Our task is to be able to identify the species of a plant given its sepal and petal dimensions.

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Instance the Iris dataset into iris variable
iris = load_iris()

In [2]:
# Using dir , we can see what methods and attributes the dataset provides:
dir(iris)

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

In [4]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

## This description holds some useful information:
- The data is composed of 150 rows (or 150 samples)... small dataset
- We have to think on how to deal with this fact when evaluating our model.
- Some classification algorithms can only deal with two class labels; we call them binary classifiers. 
- But the decision tree algorithm can deal with more than two classes, so we have no problems this time.
- The data is balanced; there are 50 samples for each class. 
- There is four numeric features: sepal length , sepal width , petal length , and petal width
- There are no missing attribute values.
- The petal dimensions correlate with the class values more than the sepal dimensions... Understanding the data is useful, but the problem here is that this correlation is calculated for the entire dataset. Ideally, we will only calculate it for our training data. Anyway, let's IGNORE this information for now and just use it for a sanity check later on.

In [7]:
# Nome das classes:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

## Opening the Data into a DataFrame:


In [8]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = pd.Series(iris.target)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


# We can see that the TARGET column has de class IDs
 - For more clarity, we can also create a new column called target_names , where we can map our numerical target values to the class names:


In [13]:
df['target_names'] = df['target'].apply(lambda y: iris.target_names[y])

# A random sample from the dataframe: 6 rows, and we use random_state with the same 
# seed to produce equal results, everytime 
df.sample(n=6, random_state=42)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_names
73,6.1,2.8,4.7,1.2,1,versicolor
18,5.7,3.8,1.7,0.3,0,setosa
118,7.7,2.6,6.9,2.3,2,virginica
78,6.0,2.9,4.5,1.5,1,versicolor
76,6.8,2.8,4.8,1.4,1,versicolor
31,5.4,3.4,1.5,0.4,0,setosa


## Splitting the data
Let's split the DataFrame into 2: 
- 70% of the records goes into the training set
- 30% goes into testing set
- choice of 70/30 is arbitrary

We will use the `train_test_split()` 

In [15]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.3)

In [18]:
# checking how many rows in the test and train DataFrames
print('Nº Rows in TEST:  ', df_test.shape[0])
print('Nº Rows in TRAIN:', df_train.shape[0])

Nº Rows in TEST:   45
Nº Rows in TRAIN: 105


The feature_names method in iris contains a list of the corresponding column names to our features. 
## Creating our `x` and `y` sets, as follows:


In [19]:
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']

In [21]:
x_train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
32,5.2,4.1,1.5,0.1
94,5.6,2.7,4.2,1.3
112,6.8,3.0,5.5,2.1
56,6.3,3.3,4.7,1.6
11,4.8,3.4,1.6,0.2


In [28]:
y_train

32     0
94     1
112    2
56     1
11     0
      ..
73     1
119    2
92     1
143    2
130    2
Name: target, Length: 105, dtype: int64

## Training the model DecisionTreeClassifier and using it for prediction

To get a feel for how everything works, we will train our algorithm using its default
configuration for now. 


In [29]:
from sklearn.tree import DecisionTreeClassifier

# It is common to call the classifier instance clf
clf = DecisionTreeClassifier()

In [30]:
# training the model using fit()
clf.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [31]:
# calling the fit() method, the clf instance is trained and ready to be used for
#predictions, so we call the predict() method on x_test

y_test_predicted = clf.predict(x_test)

## Evaluating our predictions
As we have `y_test_predict` , all we need now is to compare it to `y_test` to check how
good our predictions are. 

- Metrics for evaluating a classifier: precision , recall , and accuracy . 

The Iris dataset is a balanced dataset; it has the same number of instances for each class. Therefore, it is apt to use the accuracy metric here.

In [32]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_test_predicted)
accuracy

0.9555555555555556

In [39]:
from sklearn.metrics import r2_score

# Erro quadrático RSS
def calc_rss(y,predicted):
    return float(((predicted - y) ** 2).sum())

rss = calc_rss(y_test, y_test_predicted)
print(rss)

# ajuste R² 
r2 = r2_score(y_test, y_test_predicted)
print(r2)

2.0
0.928343949044586


## Which features were more important?
We may now ask ourselves, Which features did the model find more useful in deciding the iris species? Luckily, DecisionTreeClassifier has a method called `feature_importances_` , which is computed after the classifier is fitted and scores how important each feature is to the model's decision. In the following code snippet, we will create a DataFrames where we will put the features' names and their importance together and then sort the features by their importance:

In [34]:
df_importances = pd.DataFrame(
    {
    'feature_names': iris.feature_names,  
     'feature_importances': clf.feature_importances_
    }
).sort_values('feature_importances', ascending=False).set_index('feature_names')

df_importances

Unnamed: 0_level_0,feature_importances
feature_names,Unnamed: 1_level_1
petal width (cm),0.55918
petal length (cm),0.411264
sepal length (cm),0.015255
sepal width (cm),0.014301


## Displaying the internal tree decisions
We can also print the internal structure of the learned tree using the following code
snippet:

In [37]:
from sklearn.tree import export_text

print(export_text(clf, feature_names=iris.feature_names, spacing=3, decimals=1))

|--- petal width (cm) <= 0.8
|   |--- class: 0
|--- petal width (cm) >  0.8
|   |--- petal length (cm) <= 4.8
|   |   |--- class: 1
|   |--- petal length (cm) >  4.8
|   |   |--- petal width (cm) <= 1.8
|   |   |   |--- petal length (cm) <= 5.3
|   |   |   |   |--- sepal length (cm) <= 6.5
|   |   |   |   |   |--- petal width (cm) <= 1.6
|   |   |   |   |   |   |--- class: 2
|   |   |   |   |   |--- petal width (cm) >  1.6
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- sepal length (cm) >  6.5
|   |   |   |   |   |--- class: 1
|   |   |   |--- petal length (cm) >  5.3
|   |   |   |   |--- class: 2
|   |   |--- petal width (cm) >  1.8
|   |   |   |--- petal length (cm) <= 4.9
|   |   |   |   |--- sepal width (cm) <= 3.1
|   |   |   |   |   |--- class: 2
|   |   |   |   |--- sepal width (cm) >  3.1
|   |   |   |   |   |--- class: 1
|   |   |   |--- petal length (cm) >  4.9
|   |   |   |   |--- class: 2



If you print the complete dataset description, you will notice that toward the end, it says
the following:

One class is linearly separable from the other two; the latter are NOT linearly separable
from each other.

This means that one class is easier to separate from the other two, while the other two are
harder to separate from each other. Now, look at the internal tree's structure. You may
notice that in the first step, it decided that anything with a petal width below or equal t
0.8 belongs to class 0 ( Setosa ). Then, for petal widths above 0.8 , the tree kept on
branching, trying to differentiate between classes 1 and 2 ( Versicolor and Virginica ).
Generally, the harder it is to separate classes, the deeper the branching goes.