*Contents*
===
- [Scikit-learn: machine learning with Python](#Scikit-learn:-machine-learning-with-Python)
    - [The *iris* dataset](#The-iris-dataset)
        - [*Exercise 1*](#Exercise-1)
    - [Feature scaling](#Feature-scaling)
        - [*Exercise 2*](#Exercise-2)
    - [Train-test split](#Train-test-split) 
    - [A model for classification](#A-model-for-classification)
        - [Training](#Training)
        - [Prediction](#Prediction)
        - [Evaluation](#Evaluation)
    - [From data to prediction](#From-data-to-prediction)
        - [*Exercise 3*](#Exercise-3)        
    - [Validation](#Validation)
    - [Decision trees](#Decision-trees)
        - [*Exercise 4*](#Exercise-4)

Scikit-learn: machine learning with Python
===

*Scikit-learn* (or *sklearn*) is a Python library and one of the most widely used machine learning softwares.

Its [website](http://scikit-learn.org/stable/) is a nice place to move the first steps into machine learning: there you can find models, utilities, examples of use and theoretical references.

The *iris* dataset
---

Scikit-learn allows to play with *toy datasets*.

In [18]:
from sklearn import datasets

iris = datasets.load_iris()

Through the *load_iris* function we have loaded the whole dataset. Such object has a dictionary-like structure (see the corresponding lesson), that is, a set of key-value pairs.

Features and labels are respectively assigned to *data* and *target* keys.

In [19]:
X = iris['data']
y = iris['target']

print(X.shape, y.shape)

(150, 4) (150,)


Inside the dataset you have some other useful stuff. For example, a dataset description is assigned to the DESCR key.

In [20]:
print(iris['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

Let's have a look at the samples and their corresponding labels.

In [21]:
for i, x in enumerate(X[:10]):
    print('feature: {} label: {}'.format(x, y[i]))

feature: [5.1 3.5 1.4 0.2] label: 0
feature: [4.9 3.  1.4 0.2] label: 0
feature: [4.7 3.2 1.3 0.2] label: 0
feature: [4.6 3.1 1.5 0.2] label: 0
feature: [5.  3.6 1.4 0.2] label: 0
feature: [5.4 3.9 1.7 0.4] label: 0
feature: [4.6 3.4 1.4 0.3] label: 0
feature: [5.  3.4 1.5 0.2] label: 0
feature: [4.4 2.9 1.4 0.2] label: 0
feature: [4.9 3.1 1.5 0.1] label: 0


### *Exercise 1*
Try to explore some other toy dataset.

Feature scaling
---
From the description, we can see that the features have different scale.

### *Exercise 2*
For each feature of the iris dataset, extract the minimum and maximum observed value.

In [22]:
import numpy as np

#FILL ME

Scikit-learn has many preprocessing tools, including scaling utilities. We can use *MinMaxScaler* to scale all of the features between 0 and 1.

In [23]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.feature_range

(0, 1)

In [24]:
X = scaler.fit_transform(X)

print(np.min(X, axis=0))
print(np.max(X, axis=0))

[0. 0. 0. 0.]
[1. 1. 1. 1.]


Train-test split
---
Before training a model, we have to split our data into train and test set.

How are the samples arranged within the dataset?

In [25]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

We have to (coherently!) shuffle our data.

In [26]:
from sklearn.utils import shuffle

X, y = shuffle(X, y, random_state=0)#a "seed" controls randomicity

for i, x in enumerate(X[:10]):
    print('feature: {} label: {}'.format(x, y[i]))

feature: [0.41666667 0.33333333 0.69491525 0.95833333] label: 2
feature: [0.47222222 0.08333333 0.50847458 0.375     ] label: 1
feature: [0.33333333 0.91666667 0.06779661 0.04166667] label: 0
feature: [0.83333333 0.375      0.89830508 0.70833333] label: 2
feature: [0.19444444 0.58333333 0.08474576 0.04166667] label: 0
feature: [0.55555556 0.54166667 0.84745763 1.        ] label: 2
feature: [0.19444444 0.625      0.05084746 0.08333333] label: 0
feature: [0.66666667 0.45833333 0.62711864 0.58333333] label: 1
feature: [0.69444444 0.33333333 0.6440678  0.54166667] label: 1
feature: [0.5        0.33333333 0.50847458 0.5       ] label: 1


Let's use the first two-thirds of our samples (100) for training and the rest (50) for testing.

In [27]:
X_train = X[:100]#the first 100 samples (=rows)
X_test = X[100:]#the rest

print(X_train.shape, X_test.shape)

(100, 4) (50, 4)


Same for the labels.

In [28]:
y_train = y[:100]
y_test = y[100:]

print(y_train.shape, y_test.shape)

(100,) (50,)


A model for classification
---
We are ready for training. Let's create a $K$*-nearest neighbors* (KNN) model for classifying the flowers of the iris dataset.

In [30]:
from sklearn.neighbors import KNeighborsClassifier as KNN

model = KNN()

model

KNeighborsClassifier()

A [Scikit-learn model](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) has several properties. Among them, the number of neighbors to be inspected to make a prediction; namely, $K$.

$K$ has a default value, that we can set at creation time.

In [31]:
model.n_neighbors

5

In [32]:
model = KNN(n_neighbors=3)

model.n_neighbors

3

### Training

We can train our model on the training data using the *fit* function.

In [33]:
model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

### Prediction

Once trained, we can use our model to predict the class of the test samples.

In [34]:
predictions = model.predict(X_test)

predictions.shape

(50,)

In [35]:
predictions

array([0, 0, 1, 2, 2, 0, 0, 0, 1, 1, 0, 0, 1, 0, 2, 1, 2, 1, 0, 2, 0, 2,
       0, 0, 2, 0, 2, 1, 1, 1, 2, 2, 2, 1, 0, 1, 2, 2, 0, 1, 1, 2, 1, 0,
       0, 0, 2, 1, 2, 0])

### Evaluation

We can now assess the performance of the model. That is, check its predictions against the *true* labels.

In [36]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, predictions)
print('Model accuracy on test set:', acc)

Model accuracy on test set: 0.96


From data to prediction
---

We have all we need to load a dataset, build a model, train and test it.

In [38]:
#import libraries
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score

#load data
iris = datasets.load_iris()
X, y = iris['data'], iris['target']

#scale features between 0 and 1
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

#split data into train and test
X, y = shuffle(X, y, random_state=0)
X_train, X_test = X[:100], X[100:]
y_train, y_test = y[:100], y[100:]

#train the model
model = KNN(n_neighbors=3).fit(X_train, y_train)

#get predictions and assess model performance
predictions = model.predict(X_test)
print('Model accuracy on test set:', accuracy_score(y_test, predictions))

Model accuracy on test set: 0.96


### *Exercise 3*
Try these steps with the *boston* dataset. Use KNN for regression (*KNeighborsRegressor*) and *mean absolute error* as evaluation metric.

In [39]:
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.metrics import mean_absolute_error

boston = datasets.load_boston()

#FILL ME

Validation
---
Let's go back to iris classification and adjust the value of $K$. We will need a *validation set*.

In [40]:
from sklearn.neighbors import KNeighborsClassifier as KNN

#load data
iris = datasets.load_iris()
X, y = iris['data'], iris['target']

#scale features between 0 and 1
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

#split data into train, validation and test
X, y = shuffle(X, y, random_state=123)
X_tr, X_val, X_test = X[:75], X[75:100], X[100:]
y_tr, y_val, y_test = y[:75], y[75:100], y[100:]

print(X_tr.shape, X_val.shape, X_test.shape)

(75, 4) (25, 4) (50, 4)


We choose some value for $K$ and pick the one that obtains the best performance on the validation set.

In [41]:
for k in [1,2,3,5,10]:
    model = KNN(n_neighbors=k)
    model.fit(X_tr, y_tr)
    predictions = model.predict(X_val)
    validation_accuracy = accuracy_score(y_val, predictions)
    print('Validation accuracy with k {}: {:.2f}'.format(k, validation_accuracy))                  

Validation accuracy with k 1: 0.92
Validation accuracy with k 2: 0.96
Validation accuracy with k 3: 0.92
Validation accuracy with k 5: 0.92
Validation accuracy with k 10: 0.92


Then, we build a model with such best *K* and get the test predictions.

In [42]:
best_model = KNN(n_neighbors=2)
best_model.fit(X_tr, y_tr)
predictions = best_model.predict(X_test)

print('Test accuracy:', accuracy_score(y_test, predictions))

Test accuracy: 0.96


Decision trees
---
Let's create a *decision tree* (DT) and use it for classification.

In [44]:
from sklearn.tree import DecisionTreeClassifier as DT

model = DT()

model

DecisionTreeClassifier()

In [45]:
predictions = model.fit(X_train, y_train).predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print('Accuracy on test set: {:.2f}'.format(accuracy))

Accuracy on test set: 1.00


Note how all Scikit-learn models have the same basic functions.

### *Exercise 4*

Try what we have seen with KNN, validating the *depth* (*max_depth* property of DecisionTreeClassifier) of a decision tree.

<script>
  $(document).ready(function(){
    $('div.back-to-top').hide();
    $('nav#menubar').hide();
    $('div.prompt').hide();
    $('.hidden-print').hide();
  });
</script>

<footer id="attribution" style="float:right; color:#999; background:#fff;">
Created with Jupyter, delivered by Fastly, rendered by Rackspace.
</footer>