# Supervised Machine Learning - scikit learn

The example uses the Iris Dataset. (The Iris dataset section is adatped from an example from Analyics Vidhya)

[https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set "Iris flower data set")

In [30]:
import numpy as np
import matplotlib as mp
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

In [31]:
# Load the sample data set from the datasets module
dataset = datasets.load_iris()

In [32]:
# Display the data in the test dataset
dataset

 'data': array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
      

In [33]:
# Species of Iris in the dataset
dataset['target_names']

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

# Iris Setosa
![Kosaciec_szczecinkowaty_Iris_setosa.jpg](../figures/Kosaciec_szczecinkowaty_Iris_setosa.jpg)
# Iris Versicolor
![220px-Iris_versicolor_3.jpg](../figures/220px-Iris_versicolor_3.jpg)
# Iris Virginica
![220px-Iris_virginica.jpg](../figures/220px-Iris_virginica.jpg)

In [34]:
# Names of the type of information recorded about an Iris - called features
dataset['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [35]:
# First 10 sets of Iris data
dataset['data'][:10]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1]])

In [36]:
# The classification of each of the first 10 sets of Iris data - the target
dataset['target'][:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Here 0 equates to setosa the first entry in the 'target_names' array

In [37]:
# Now we create our model
model = LogisticRegression()
# We train it by passing in the test data and the actual results
model.fit(dataset.data, dataset.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [38]:
# We use the model to create predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# Using the metrics module we see the results of the model
metrics.accuracy_score(expected, predicted, normalize=True, sample_weight=None)

0.95999999999999996

## Digging deeper using metrics
### Accuracy score, Classification report & Confusion matix

Here we will use a simple example to show metrics you can use: accuracy, classification reports and confusion matrices.

- y_true is the test data
- y_pred is the prediction

In [39]:
y_true = ["cat", "ant", "cat", "cat", "ant", "bird", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat", "bird"]

In [40]:
metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

0.7142857142857143

5 correct predictions out of 7 values. 71% accuracy

In [41]:
print(metrics.classification_report(y_true, y_pred,
    target_names=["ant", "bird", "cat"]))

             precision    recall  f1-score   support

        ant       0.67      1.00      0.80         2
       bird       1.00      0.50      0.67         2
        cat       0.67      0.67      0.67         3

avg / total       0.76      0.71      0.70         7



Here we can see that the predictions:
- found ant 3 times and should have found it twice hence precision of 0.67.
- never predicted ant when shouldn't have hence recall of 1.
- f1 source is the mean of precision and recall
- support of 2 meaning there were 2 in the true data set.

[http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

In [43]:
metrics.confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 1, 1],
       [1, 0, 2]])

In the confusion_matrix the labels give the order of the rows.

- ant was correctly categorised twice and was never miss categorised
- bird was correctly categorised once and was categorised as cat once
- cat was correctly categorised twice and was categorised as an ant once

## Back to Iris predictions

In [42]:
print(metrics.classification_report(expected, predicted,target_names=dataset['target_names']))

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        50
 versicolor       0.98      0.90      0.94        50
  virginica       0.91      0.98      0.94        50

avg / total       0.96      0.96      0.96       150



In [44]:
print (metrics.confusion_matrix(expected, predicted))

[[50  0  0]
 [ 0 45  5]
 [ 0  1 49]]


In the confusion_matrix the labels give the order of the rows.

- setosa was correctly and was never miss categorised
- versicolor was correctly categorised 45 times and was categorised as virginica 5 times
- virginica was correctly categorised 49 times and was categorised as versicolor once