<a href="https://colab.research.google.com/github/albertomanfreda/intensive_school_ml/blob/master/LessonScikitLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scikit-learn

Scikit-learn is an open source machine learning library, which provides a variety of built-in machine learning algorithms and models, called **estimators**. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

The basic workflow with scikit-learn is that you define an **estimator**, then fit it to some data using its **fit** method (which correspond to the **training** of the model) and eventually use it to predict some unknown feature, which is done with the **predict** method.

Unless you have previous experiences with ML, you may not know what these models are - that is what the next days of the school are for!
However, I will show a very basic example, just so you can get a feeling how how it works.

We will use a **Decision Tree** applied to a classic classification problem: try to predict the specie of an Iris based on some characteristics of the petals and the sepals. These will also gives us the opprtunity to see how to load models and split them into test and training data.

## Datasets

The **sklearn.datasets** module contains a number of classic toy datasets, which are very useful either for learning ML, as well as for performing checks on your models.

For this example we will use the Iris plants dataset, which contains 150 entries, each corresponding to an Iris. The Iris are of three different kinds, which are the **classes** we want to predict:
 
  * Iris-Setosa
  * Iris-Versicolour
  * Iris-Virginica

For each entry, we have 4 **features**, which we will use to try and predict the class:

  * sepal length in cm
  * sepal width in cm
  * petal length in cm
  * petal width in cm  

and a **target** (or **label**) which tells us the class for that entry.

In [None]:
import sklearn
import sklearn.datasets

# Load the dataset
iris_dataset = sklearn.datasets.load_iris()

""" The dataset has 6 main attributes: data, target, target_names,
feature_names, DESCR and file_name.

- data is an NumPy ndarray of shape n_samples x n_features (in this case
150 x 4) which contains the numerical values of the features for each entry. 

- target is a 1d ndarray of size n_samples, containing the class for each entry
expressed as an intger number: 0, 1, or 2

- target_names is a ndarray of strings, which tells us the name of each class.

- feature_names similarly contains the name of the features

- DESCR is s tring describing the dataset

- file_name is a string containing the path to the actual file where the data
are stored
"""

print(iris_dataset.data.shape)
print(iris_dataset.target.shape)
print(iris_dataset.target_names)
print(iris_dataset.feature_names)
print(iris_dataset.DESCR)


In [None]:
%matplotlib inline

# Let's inspect the data, starting with the classes
import numpy as np
from matplotlib import pyplot as plt

classes_dict = {}
for i, name in enumerate(iris_dataset.target_names):
    mask = iris_dataset.target == i
    iris_class = iris_dataset.target[mask]
    classes_dict[name] = len(iris_class)
print(classes_dict)
plt.pie(classes_dict.values(), labels=classes_dict.keys())

In [None]:
# Now the features

# Print the first 10 entries
print(iris_dataset.data[:10])

# Observe one of the categories - for example the first (sepal length)
sepal_length = iris_dataset.data[:, 0]
# Calulate its mean and rms
mean_sepal_length = iris_dataset.data
bins = np.linspace(min(sepal_length), max(sepal_length), 21)
plt.figure('sepal_length')
plt.hist(sepal_length, bins=bins)
plt.xlabel('{} [cm]'.format(iris_dataset.feature_names[0]))
plt.ylabel('entries/bin')

In [None]:
import itertools
# Let's do some correlation plots
# Iterate over all the possible unique pairs of numbers from 0, 1, 2, 3
# There are 6 of them: (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)
fig, ax = plt.subplots(2, 3, figsize=(18, 12))
for i, (a, b) in enumerate(itertools.combinations(range(4), 2)):
    x = iris_dataset.data[:, a]
    y = iris_dataset.data[:, b]
    _ax = fig.axes[i]
    _ax.scatter(x, y)
    _ax.set_xlabel('{} [cm]'.format(iris_dataset.feature_names[a]))
    _ax.set_ylabel('{} [cm]'.format(iris_dataset.feature_names[b]))
plt.show()

In [None]:
# Finally, let's re-do the last correlation plot, petal-length vs petal-width,
# using different colors for different classes 

plt.figure('scatter_coloured', figsize=(8, 6))
colors = ['red', 'green', 'blue']
for i, name in enumerate(iris_dataset.target_names):
    mask = iris_dataset.target == i
    x = iris_dataset.data[mask][:, 2] # petal length
    y = iris_dataset.data[mask][:, 3] # petal width
    plt.scatter(x, y, color=colors[i], label=name)
plt.xlabel(iris_dataset.feature_names[2])
plt.ylabel(iris_dataset.feature_names[3])
plt.legend()
plt.show()

Now let's split our data into a train and test sample. We will train our model on the train sample and test it on the test sample. The split is random (it is always a good idea to shuffle the data before training).

scikit-learn offers an handy tool for randomly splitting the sample into train and test: the **train_test_split** function.

In [None]:
import sklearn.model_selection
""" Split the dataset into training and testing. We will use 80% of it 
for training and 20% for testing.
Note: we use the 'random_state' argument to ensure repeatibility of the
results (you can use any number instead of 42, its purpose is just to
initialize the pseudo-random number generator)."""
train_features, test_features, train_targets, test_targets  = \
  sklearn.model_selection.train_test_split(iris_dataset.data,
                                           iris_dataset.target,
                                           test_size = 0.2,
                                           random_state=42)
print(train_features.shape, test_features.shape, train_targets.shape, 
      test_targets.shape)

Now that we have explored the dataset, let's build our model

In [None]:
from sklearn import tree
model = tree.DecisionTreeClassifier(
    criterion                = 'gini',
    splitter                 = 'best',
    max_depth                = 3,
    min_samples_split        = 2,
    min_samples_leaf         = 1,
    min_weight_fraction_leaf = 0.05
)

# Train it
model = model.fit(train_features, train_targets)


In [None]:
# Let's draw the tree
plt.figure(figsize=(12, 12))
tree.plot_tree(model, feature_names=iris_dataset.feature_names)

Let's see how good our model is by trying to predict the classes based on the test features and comparing it with the test targets.

In [None]:
predicted_targets = model.predict(test_features)
right_pred = predicted_targets == test_targets
# Since True == 1 and False == 0 the sum() of the elements of the mask is the
# equal to number of right choices 
num_right_pred = right_pred.sum()
print('Our model got {:d} right prediction out of {:d}'.format(
    num_right_pred, len(predicted_targets)))

That is pretty awsome!