## Introduction to the scikit-learn API for Machine Learning

This notebook is to introduce the reader to the scikit API for Machine Learning. We will look at asimple example, to understand the main pipeline followed in the manuscript.The interest reader might deepen this concepts by reading the book.

In [1]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
import pickle

We create a toy example. This is represented by two variables, describing the featuers and target training vectors.

In [2]:
X_train = [{'airport':'CDG', 'month':'October'},
     {'airport':'MPX', 'month':'June'},
     {'airport':'VLC', 'month':'February'},
     {'airport':'CDG', 'month':'December'}]

y_train = ['rain', 'rain', 'sun', 'rain']

Since we have to deal with categorical variables, we use the pipeline object from scikit-learn. We combine a vectorizer (which basically converts categorical into numerical vectors), and a classifier (here the Linear Support Vector Classifier, `LinearSVC`). Then we call `fit` to train all the components in the pipeline.

In [3]:
pipeline = make_pipeline( DictVectorizer(), LinearSVC() )
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('dictvectorizer',
                 DictVectorizer(dtype=<class 'numpy.float64'>, separator='=',
                                sort=True, sparse=True)),
                ('linearsvc',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

We apply our classifier to a test set to make `prediction` on unseen data.

In [4]:
X_test = [{'airport':'JFK', 'month':'June'},
         {'airport':'CDG', 'month':'November'},
         {'airport':'MPX', 'month':'June'},
         {'airport':'MPX', 'month':'November'}]

y_test = ['rain', 'rain', 'sun', 'rain']


In [5]:
y_pred = pipeline.predict(X_test)
y_pred

array(['rain', 'rain', 'rain', 'rain'], dtype='<U4')

We finally evaluate the model.

In [6]:
accuracy_score(y_test, y_pred)

0.75