In [1]:
# Load libraries
import numpy as np;

# Intro to Scikit-Learn

This notebook is a brief introduction to scikit-learn, one of the most popular libraries, and probably the most complete, that Python has to perform Machine Learning computations. For more information visit http://scikit-learn.org.

## About Scikit-Learn

Scikit-Learn is a Python package designed to give access to popular machine learning algorithms, through a clean and well thought out API. It was built by hundreds of collaborators around the world and is used throughout industry and academia.

Scikit-Learn is based on Python's NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient numerical and scientific computing through Python.



## Datasets in Scikit-Learn

Most of the machine learning algorithms implemented in scikit-learn expect the data to be stored in a two-dimensional array. The arrays will normally be numpy arrays. The size of the array is expected to be [n_samples, n_features], where:

- **n_samples:** the number of samples. Each sample is an instance to process. A sample can be a document, an image, a sound, a video, an astronomical object, a row in the database, or anything you can describe with a fixed set of quantitative or categorical traits.
    
- **n_features:** the number of different characteristics or traits that can be used to describe each instance. Features generally have a real value, but can be Boolean or discrete for some models.

### Example: Iris dataset

Scikit-Learn has several datasets pre-loaded ready to be used: https://scikit-learn.org/stable/datasets/toy_dataset.html

One of the best known is the iris dataset. Let's import it.

In [2]:
from sklearn.datasets import load_iris;
iris = load_iris();
dat = iris.data;
target = iris.target;
target_names = iris.target_names;

Let's check the target

In [3]:
unique, counts = np.unique(target, return_counts=True);
print(np.asarray((unique, counts)).T);

[[ 0 50]
 [ 1 50]
 [ 2 50]]


50 examples of each type of flower, where each numerical code represents the corresponding type of iris flower in this vector:

In [4]:
target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Iris Setosa
<img src="figures/iris_setosa.jpg" width="50%">

Iris Versicolor
<img src="figures/iris_versicolor.jpg" width="50%">

Iris Virginica
<img src="figures/iris_virginica.jpg" width="50%">

Now let's the dataset dimensions

In [5]:
dat.shape

(150, 4)

We see that it contains 150 instances and 4 columns or variables. Let's look at the content of the first 5 rows.

In [6]:
dat[0:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

We observe that we have 4 variables for each flower. These correspond to:

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

We represent each flower example as a row in our data matrix, and the columns (characteristics) represent the flower's measurements in centimeters. For example, we can represent the dataset *Iris*, consisting of 150 examples and 4 characteristics, as a two-dimensional array or matrix in $ \mathbb {R} ^ {150 \times 4} $ in the following format:

$$\mathbf{X} = \begin{bmatrix}
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots  & x_{4}^{(1)} \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots  & x_{4}^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & \dots  & x_{4}^{(150)}
\end{bmatrix}.
$$

(The superscript denotes the *i*th row, and the subscript denotes the $ j $th characteristic, respectively.

For information on all available datasets, visit this url: https://scikit-learn.org/stable/datasets.html

## Models

Scikit-Learn has an extensive battery of Machine Learning models available, this being one of the great advantages of using this library.

Another reason for its popularity is that it provides a general-purpose framework for all types of models. It basically consists of 6 steps:

1. Import model that you want to use. 

      - See list of supervised models at: https://scikit-learn.org/stable/supervised_learning.html
      - See list of unsupervised models at: https://scikit-learn.org/stable/unsupervised_learning.html
      

2. It matters metric to use. See list of available metrics at https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

3. Define model.

4. Call the fit method to train the model.

5. Call the predict method to generate the predictions.

6. Calculate metric using the predictions obtained in the previous step.

Let's see all these steps with a simple example

### Example: KNN

When using a k-nearest neighbors, KNN classification model, the prediction assigned to a point is the most common class among the k closest (or similar, according to some kind of similarity distance). If k = 1, then each sample is simply assigned the class of the closest instance.

<img src="figures/knn.png" width="50%">

Applied to regression, the operation is the same only that instead of the most common class, each instance will be assigned the mean of the target for the k closest neighbors.

Let's now train a KNN classification model for the iris dataset following the 6 steps above (**note**: we will not divide into train / validation / test set as this is a purely illustrative example):

1) Import model.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

2) Import metric.

In [8]:
from sklearn.metrics import accuracy_score as metric

3) Define model.

In [13]:
KNeighborsClassifier?

In [9]:
model = KNeighborsClassifier(n_neighbors=1)

4) Call fit method to train the model.

In [14]:
model.fit(dat, target)

KNeighborsClassifier(n_neighbors=1)

5) Call predict method to generate predictions.

In [15]:
pred = model.predict(dat)
pred[0:5]

array([0, 0, 0, 0, 0])

6) Compute metric using predictions from the previous step.

In [16]:
metric(target, pred)

1.0