# Before starting...

This jupyter notebook constitutes the session.

Two choices: 
1. **Open** this notebook and follow along the presentation
    * Open terminal
    * Type `jupyter-notebook Introduction_Machine_Learning.ipynb`
2. **Create** your own notebook and reproduce the different steps (warning!)
    * Open terminal
    * Type `jupyter-notebook My_ML_Notebook.ipynb`

we start by **importing** the basic libraries 

In [None]:
%pylab inline

___
# Machine learning

### Definition
> ML is a set of methods that can **automatically detect patterns** in data, and then use the uncovered patterns to **predict future data** (*from Machine Learning: A Probabilistic Perspective (Murphy 2012)*)

### Phases

1. **Training** an algorithm in machine learning means detecting patterns in a dataset
1. **Testing** an algorithm means predicting future data, that is generalizing the uncovered trained patterns to new datasets

### General workflow

1. Load **datasets**:
    * A **training dataset** 
        * it will be denoted as `X_train` 
        * in supervised setting, datasets also include a list of labels `y_train` associated to the samples in `X_train`
        
    * A **testing dataset** 
        * it will be denoted `X_test`
        * in supervised setting, the list of labels is `y_test`

1. Train the algorithm on the training dataset

1. Test the trained algorithm on the testing dataset

### Machine learning in python: scikit-learn

We will demonstrate machine learning methods with scikit-learn (sklearn), one the most used machine learning library in python. If you don't have the library installed, you can refer to the [Instructions to workshop participants](https://github.com/florisvanvugt/workshop4june2017).

In [None]:
# Test if sklearn is installed:
import sklearn

**Sklearn API**: http://scikit-learn.org/stable/modules/classes.html

___
# Toy Example: Iris Dataset

### Description

This dataset consists of 3 different types of irises’ (**Setosa**, **Versicolour**, and **Virginica**) given by their:
* Sepal Length
* Sepal Width
* Petal Length 
* Petal Width.

The classes are encoded as integers: 
* Setosa = 0
* Versicolour = 1
* Virginica = 2

Rows are the samples and the columns the feature dimensions (Sepal Length, Sepal Width, Petal Length and Petal Width).

### Iris dataset in sklearn

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
features = iris.data
labels = iris.target

In [None]:
print('Number of observations:', len(features), ' | Dimension:', len(features[0]))

In [None]:
features[:5,:]

In [None]:
labels

### Visualizing the dataset

**Problem:** Feature dimension is **4**, which makes it hard to visualize in a simple 2-d plot. 

_Choice 1:_ we select only the two first dimensions and visualize them in a scatter plot.

In [None]:
data_x = features[:,0] # sepal length
data_y = features[:,1] # sepal width

In [None]:
scatter(data_x, data_y, c=labels)

In [None]:
# explicit color for labels
color_table = [[1.,0.,0.], [0.,1.,0.], [0.,0.,1.]]
label_colors = [ color_table[l] for l in labels ]
scatter(data_x, data_y, c=label_colors)

In [None]:
data_x = features[:,2] # petal length
data_y = features[:,3] # petal width

In [None]:
scatter(data_x, data_y, c=label_colors)

___
# Training a Classifier on the Dataset

As use case, we will explore classifier training with the **Support Vector Machine (SVM)**

The support vector machine, in its simplest version, is a **linear discriminant model**. 

Some usfule readings to know more about SVM:
* C Cortes, V Vapnik. Support-vector networks. _Machine learning_ 20 (3), 273-297, 1995
* B Schölkopf, AJ Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. *MIT press*, 2002

### SVM in sklearn

In [None]:
from sklearn import svm

Initialize a new SVM instance called `classifier`:

In [None]:
classifier = svm.SVC()

In [None]:
classifier

We use the linear version (simpler) of support vector machine:

In [None]:
classifier = svm.SVC(kernel='linear')

In [None]:
classifier

### Training a SVM

Classification is a supervised learning task, meaning that it learns the function mapping feature samples to known labels. 

In [None]:
for n in range(len(features)):
    print(features[n,:], '\t==> ', labels[n])

Training on the full dataset:

In [None]:
classifier.fit(features, labels);

Training on the two dimensions visualized previously:

In [None]:
classifier.fit(features[:,2:], labels);

**NOTES**
* `fit` is the generic function to train any methods in sklearn
* for supervised methods, `fit` accepts two arguments: the feature data and their labels, that is `fit(X_train, y_train)`
* for unsupervised methods, `fit` accepts only one argument: the feature data, that is `fit(X_train)`

### Understanding training in SVM

Understanding training procedure in machine learning starts by understanding the **decision boundary** which is the set of borders delimiting regions in the feature space associated to each labels. 

Let's take the two last dimensions of the iris data

In [None]:
scatter(features[:,2], features[:,3], c=label_colors)

Let's consider only two classes given by the <span style="color:#DD0000;">**RED**</span> and <span style="color:#00DD00;">**GREEN**</span> colours (class 0 and 1 respectively)

In [None]:
np.where( (labels == 0) | (labels == 1) )

In [None]:
class_indexes = np.where( (labels == 0) | (labels == 1) )[0]

In [None]:
X_train = features[class_indexes, 2:]
y_train = labels[class_indexes]

In [None]:
label_colors_2classes = np.array(label_colors)[class_indexes]

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)

**Question: what is the best decision boundary between classes 0 and 1?**

Linear models, such as SVM, consider linear decision boundaries, which means here a **line**!

A line can be define by 2 parameters:

In [None]:
slope = -0.1
intercept = 1.2

We generate the corresponding line:

In [None]:
boundary_x = np.linspace(1,5)
boundary_y = slope * boundary_x + intercept

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)
plot(boundary_x, boundary_y, '-k')

**Is that good enough?**

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)
plot(boundary_x, boundary_y, '-k')
scatter(3.5, 0.7, c='#444444', s=400)

Trying with other parameters:

In [None]:
slope = -1.0
intercept = 3.2

In [None]:
boundary_x = np.linspace(1,4)
boundary_y = slope * boundary_x + intercept

scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)
plot(boundary_x, boundary_y, '-k')

=> Looks better....

<span style="color:#AA1111; font-size: 16px;">TRAINING</span>
* means finding the best parameters wrt the set of samples
* can often be understood as an OPTIMIZATION problem (i.e. finding a decision boundary such as miminzing a certain **cost function**)

#### Inspecting the result given by **SVM**

Train classifier with the sub-dataset comprised of only 2 classes:

In [None]:
classifier.fit(X_train, y_train);

Result of training is given by the the `coef_` structure:

In [None]:
coefs = classifier.coef_[0]

In [None]:
slope = -coefs[0] / coefs[1]
intercept = classifier.intercept_[0] / coefs[1]

In [None]:
slope

In [None]:
intercept

In [None]:
boundary_x = np.linspace(1.5,3.5)
boundary_y = slope * boundary_x - intercept

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)
plot(boundary_x, boundary_y, '-k')

### Dealing with more than one class

In [None]:
X_train = features[:,2:]
y_train = labels

In [None]:
classifier.fit(X_train, y_train);

When dealing with more than one class, SVM finds decision boundaries between pair of classes:
* Class 1 vs. Class 2
* Class 1 vs. Class 3
* Class 2 vs. Class 3

#### Plotting decision boundary between Class 1 [white] and Class 3 [black]

In [None]:
case = 1

coefs = classifier.coef_[case]

slope = -coefs[0] / coefs[1]
intercept = classifier.intercept_[case] / coefs[1]

boundary_x = np.linspace(0,10)
boundary_y = slope * boundary_x - intercept

scatter(X_train[:,0], X_train[:,1], c=label_colors)
plot(boundary_x, boundary_y, '-k')

xlim([0.5,7.5])
ylim([-0.5,3.0])

Plotting decision boundary between Class 2 [grey] and Class 3 [black]

In [None]:
case = 2

coefs = classifier.coef_[case]

slope = -coefs[0] / coefs[1]
intercept = classifier.intercept_[case] / coefs[1]

boundary_x = np.linspace(0,10)
boundary_y = slope * boundary_x - intercept

scatter(X_train[:,0], X_train[:,1], c=label_colors)
plot(boundary_x, boundary_y, '-k')

xlim([0.5,7.5])
ylim([-0.5,3.0])

#### Visualizing the partitions the underlying vector space

In [None]:
xx = np.linspace(0.5, 7.5, 200)
yy = np.linspace(-0.5, 3.0, 200)

In [None]:
zz = np.zeros((xx.shape[0],yy.shape[0]))
for i in range(len(xx)):
    for j in range(len(yy)):
        zz[i,j] = classifier.predict( np.array([xx[i],yy[j]]).reshape(1,-1) )

In [None]:
pcolormesh(xx, yy, -zz.T, cmap=plt.cm.RdBu, alpha=0.1)
scatter(X_train[:,0], X_train[:,1], c=y_train, edgecolors='k', cmap=plt.cm.RdBu_r)

___
# Real-World Dataset

We will use the _musical genre_ dataset: https://github.com/florisvanvugt/workshop4june2017/tree/master/datasets/features.

Description:
* 5 classes that are the musical genres
    * **Ambient**
    * **Country**
    * **Metal**
    * **Rock n' Roll**
    * **Symphonic**
* There are 5 excerpts per class
* Data features are the MFCC (Mel-Frequency Cepstral Coefficients) computed on each excerpt

### Load dataset

In [None]:
classes = ['ambient', 'country', 'metal', 'rocknroll', 'symphonic']

In [None]:
excerpts = [0, 1, 2, 3, 4]

Load data features

In [None]:
features = []
for c in classes:
    for e in excerpts:
        data = np.loadtxt('datasets/features/%s_%03i.mfcc'%(c,e), delimiter=',')
        for d_vect in data:
            features.append(list(d_vect))

In [None]:
len(features)

In [None]:
features = np.array( features )

In [None]:
features.shape

Meaning that we have **1625** samples and each sample had **48** dimensions...

Load labels associated to each sample

In [None]:
labels = []
for c in classes:
    for e in excerpts:
        data = np.loadtxt('datasets/features/%s_%03i.mfcc'%(c,e), delimiter=',')
        for d_vect in data:
            labels.append( classes.index(c) )

In [None]:
labels = np.array( labels )

In [None]:
labels.shape

In [None]:
labels[ [2, 100, 387, 1209, 1500] ]

Each class is encoded as an integer such as:
    * 'ambient' = 0
    * 'country' = 1
    * 'metal' = 2
    * 'rocknroll' = 3
    * 'symphonic' = 4

### Visualizing the dataset

In [None]:
scatter(features[:,1], features[:,2], c=labels)

In [None]:
# explicit color for labels?
color_table = [[1.,0.,0.], [0.,1.,0.], [0.,0.,1.], [0.,1.,1.], [1.,1.,0.]]
label_colors = [ color_table[l] for l in labels ]

In [None]:
for c in range(len(color_table)):
    scatter([0,0], [c,c], c=color_table[c], s=200)
    ylabel('Class label')

In [None]:
scatter(features[:,1], features[:,2], c=label_colors)

We can also try with two other feature dimensions

In [None]:
scatter(features[:,10], features[:,27], c=label_colors)

### Training SVM on this Dataset

Let's Train on the two dimensions visualized previously:

In [None]:
classifier.fit(features[:,1:3], labels);

Like before, let's consider only two classes given by the <span style="color:#DD0000;">**RED**</span> and <span style="color:#DDDD00;">**YELLOW**</span> colours (class 0 and 4 respectively)

In [None]:
np.where( (labels == 0) | (labels == 4) )

In [None]:
class_indexes = np.where( (labels == 0) | (labels == 4) )[0]

In [None]:
X_train = features[class_indexes, 1:3]
y_train = labels[class_indexes]

In [None]:
label_colors_2classes = np.array(label_colors)[class_indexes]

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)

Train classifier with the sub-dataset comprised of only 2 classes:

In [None]:
classifier.fit(X_train, y_train);

Result of training is given by the the `coef_` structure:

In [None]:
coefs = classifier.coef_[0]

In [None]:
slope = -coefs[0] / coefs[1]
intercept = classifier.intercept_[0] / coefs[1]

In [None]:
slope

In [None]:
intercept

In [None]:
boundary_x = np.linspace(200,500)
boundary_y = slope * boundary_x - intercept

In [None]:
scatter(X_train[:,0], X_train[:,1], c=label_colors_2classes)
plot(boundary_x, boundary_y, '-k')

### Extend to multi-class training

In [None]:
X_train = features[:,1:3]
y_train = labels

In [None]:
classifier.fit(X_train, y_train);

In [None]:
xx = np.linspace(100, 500, 200)
yy = np.linspace(-150, 150, 200)

In [None]:
zz = np.zeros((xx.shape[0],yy.shape[0]))
for i in range(len(xx)):
    for j in range(len(yy)):
        zz[i,j] = classifier.predict( np.array([xx[i],yy[j]]).reshape(1,-1) )

In [None]:
pcolormesh(xx, yy, -zz.T, cmap=plt.cm.RdBu, alpha=0.1)
scatter(X_train[:,0], X_train[:,1], c=y_train, edgecolors='k', cmap=plt.cm.RdBu_r)

___
# Testing SVM Classification

> _Reminder:_ ML method is trained on a **training dataset** and its generalizability is evaluated on a **testing dataset**.

Our goal here: **splitting** the dataset into training and testing sub-datasets.

### Splitting datasets in scikit-learn

From the API: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

Function | Description
--- | ---
`model_selection.KFold([n_splits, shuffle, ...])` | K-Folds cross-validator
`model_selection.GroupKFold([n_splits])`	| K-fold iterator variant with non-overlapping groups.
`model_selection.StratifiedKFold([n_splits, ...])`	| Stratified K-Folds cross-validator
`model_selection.LeaveOneGroupOut()`	| Leave One Group Out cross-validator
`model_selection.LeavePGroupsOut(n_groups)`	| Leave P Group(s) Out cross-validator
`model_selection.LeaveOneOut()`	| Leave-One-Out cross-validator
`model_selection.LeavePOut(p)`	| Leave-P-Out cross-validator
`model_selection.ShuffleSplit([n_splits, ...])`	| Random permutation cross-validator
`model_selection.GroupShuffleSplit([...])`	| Shuffle-Group(s)-Out cross-validation iterator
`model_selection.StratifiedShuffleSplit([...])`	| Stratified ShuffleSplit cross-validator
`model_selection.PredefinedSplit(test_fold)`	| Predefined split cross-validator
`model_selection.TimeSeriesSplit([n_splits])`	| Time Series cross-validator

Example with **stratified k-fold**

In [None]:
from sklearn.model_selection import StratifiedKFold

In [None]:
splitter = StratifiedKFold( n_splits=3 )

We use the two dimensions plotted previously

In [None]:
X = np.array( features[:,1:3] )
y = np.array( labels )

In [None]:
splitter.split(X,y)

In [None]:
for train_index, test_index in splitter.split(X,y):
    print("training:", labels[train_index])
    print("testing:", labels[test_index])

### Testing SVM on the splitted datasets

In [None]:
splitter = StratifiedKFold(n_splits=3)

In [None]:
train_index, test_index = next(splitter.split(X,y))

In [None]:
train_index

In [None]:
test_index

In [None]:
y[test_index]

In [None]:
# training dataset
X_train = X[train_index]
y_train = y[train_index]

In [None]:
# testing dataset
X_test = X[test_index]
y_test = y[test_index]

In [None]:
# init SVM classifier
classifier = svm.SVC(kernel='linear')

# train VM classifier
classifier.fit(X_train, y_train)   
    
# test SVM classifier and store output
y_pred = classifier.predict(X_test)

In [None]:
y_pred

Counting the number of errors of our prediction:

In [None]:
num_errors = 0
for i,yi in enumerate(y_pred):
    if (yi != y_test[i]):
        num_errors += 1

In [None]:
1.0 - num_errors/len(y_pred)

We can actually compute the score directly into sklearn with `score()`

In [None]:
# sklearn function
score = classifier.score(X_test, y_test)

In [None]:
score

**BUT** we test only on one split, what if the split leads to particularly well disciminated training dataset but not testing dataset. Or the contrary... 

We have to consider more than one split

In [None]:
splitter = StratifiedKFold(n_splits=10)

all_scores = []

for train_index, test_index in splitter.split(X,y):

    # select training and testing datasets
    X_train = X[train_index]
    y_train = y[train_index]
    X_test  = X[test_index]
    y_test  = y[test_index]   
    
    # declare classifier
    classifier = svm.SVC(kernel='linear')
    classifier.fit(X_train,y_train)   
    
    # compute score on testing dataset and store it
    score = classifier.score(X_test,y_test)
    all_scores.append(score)  
    
    # print score
    print('score: %.2f%%'%(score*100))

In [None]:
print(np.mean(all_scores)*100)

In [None]:
print(np.std(all_scores)*100)

The process of evaluating a model on various splits within a bigger dataset is called **CROSS-VALIDATION**. 

___
# Comparing Different Classifiers

In machine learning, we usually compare various models in order to pick the best one for a particular application. Model comparison can be done through cross-validation.

For the sake of comparison, we compare classification accuracy for three classifiers:
* Linear SVM
    *  `svm.SVC(kernel='linear')`
* Non-linear SVM
    * `svm.SVC(kernel='rbf')`
* k-Nearest Neighbour
    * `neighbors.KNeighborsClassifier()`

First we load kNN classifier from the sklearn library:

In [None]:
from sklearn import neighbors

Then we write a for loop on the dataset splits:

In [None]:
classifiers = ['SVM-linear', 'SVM-nonlinear', 'kNN']
all_scores = {'SVM-linear': [], 'SVM-nonlinear': [], 'kNN': []}

splitter = StratifiedKFold(n_splits=10)

for train_index, test_index in splitter.split(X,y):

    # select training and testing datasets
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]   
    
    
    for clf in classifiers:
        
        # declare classifier
        if (clf=='SVM-linear'):
            classifier = svm.SVC(kernel='linear')
        elif (clf=='SVM-nonlinear'):
            classifier = svm.SVC(kernel='rbf')
        elif (clf=='kNN'):
            classifier = neighbors.KNeighborsClassifier()
            
        # train classifier
        classifier.fit(X_train,y_train)

        # compute score on testing dataset and store it
        score = classifier.score(X_test,y_test)
        all_scores[clf].append(score)  
    
        # print score
        print(clf, 'score: %.2f%%'%(score*100)) 


In [None]:
for clf in classifiers:
    print(clf, 'mean score: %.2f%%'%(np.mean(all_scores[clf])*100)) 

### Visualizing decision boundaries

In [None]:
def partitions(classifier_, X_train_, y_train_):
    xx = np.linspace( np.min(X_train_[:,0]), np.max(X_train_[:,0]), 200 )
    yy = np.linspace( np.min(X_train_[:,1]), np.max(X_train_[:,1]), 200 )
    zz = np.zeros( (xx.shape[0],yy.shape[0]) )
    for i in range(len(xx)):
        for j in range(len(yy)):
            zz[i,j] = classifier_.predict( np.array([xx[i],yy[j]]).reshape(1,-1) )
    scatter(X_train_[:,0], X_train_[:,1], c=y_train_, edgecolors='k', cmap=plt.cm.RdBu_r)
    pcolormesh(xx, yy, -zz.T, cmap=plt.cm.RdBu, alpha=0.1)

In [None]:
figure(figsize=(16,5))

for i,clf in enumerate(['SVM-linear', 'kNN']):
    
    subplot(1,2,i+1)

    if (clf=='SVM-linear'):
        classifier = svm.SVC(kernel='linear')

    elif (clf=='kNN'):
        classifier = neighbors.KNeighborsClassifier()

    classifier.fit(X_train, y_train)
    partitions(classifier, X_train, y_train)

### Parametric vs. Non-Parametric methods

### Comparing classifiers on the original vector space

In [None]:
X = np.array( features )
y = np.array( labels )

In [None]:
X.shape

In [None]:
classifiers = ['SVM-linear', 'SVM-nonlinear', 'kNN']
all_scores = {'SVM-linear': [], 'SVM-nonlinear': [], 'kNN': []}

splitter = StratifiedKFold(n_splits=10)

for train_index, test_index in splitter.split(X,y):

    # select training and testing datasets
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]   
    
    
    for clf in classifiers:
        
        # declare classifier
        if (clf=='SVM-linear'):
            classifier = svm.SVC(kernel='linear')
        elif (clf=='SVM-nonlinear'):
            classifier = svm.SVC(kernel='rbf')
        elif (clf=='kNN'):
            classifier = neighbors.KNeighborsClassifier()
            
        # train classifier
        classifier.fit(X_train,y_train)

        # compute score on testing dataset and store it
        score = classifier.score(X_test,y_test)
        all_scores[clf].append(score)  
    
        # print score
        print(clf, 'score: %.2f%%'%(score*100)) 

In [None]:
for clf in classifiers:
    print('-', clf, 'mean score: %.2f%%'%(np.mean(all_scores[clf])*100))

Non-linear SVM returns classification accuracy at **chance level** (20%)

=> Some techniques are sensitive to the input vectors range.

In [None]:
classifiers = ['SVM-linear', 'SVM-nonlinear', 'kNN']
all_scores = {'SVM-linear': [], 'SVM-nonlinear': [], 'kNN': []}

splitter = StratifiedKFold(n_splits=10)

for train_index, test_index in splitter.split(X,y):

    # select training and testing datasets
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]   
    
    
    for clf in classifiers:
        
        # declare classifier
        if (clf=='SVM-linear'):
            classifier = svm.SVC(kernel='linear')
        elif (clf=='SVM-nonlinear'):
            classifier = svm.SVC(kernel='rbf')
        elif (clf=='kNN'):
            classifier = neighbors.KNeighborsClassifier()
            
        # train classifier
        X_train = np.subtract( X_train, np.mean(X_train, axis=0) )
        X_train = np.divide( X_train, np.std(X_train, axis=0) )
        classifier.fit(X_train,y_train)

        # compute score on testing dataset and store it
        X_test = np.subtract( X_test, np.mean(X_test, axis=0) )
        X_test = np.divide( X_test, np.std(X_test, axis=0) )
        score = classifier.score(X_test,y_test)
        all_scores[clf].append(score)  
    
for clf in classifiers:
    print('-', clf, 'mean score: %.2f%%'%(np.mean(all_scores[clf])*100))

**NOTE**
* Sklearn has several functions to pre-process data before feeding them to some classifiers or regressors
* More: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

___
About this material: copyright Baptiste Caramiaux (write me for any questions or use of this material [email](mailto:baptiste.caramiaux@ircam.fr))
___