In [1]:
%matplotlib inline
import pandas, sklearn

# Machine Learning with Scikit-learn

Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data driven predictions or decisions through building a model from sample inputs. [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning#Types_of_problems_and_tasks)

## Types of Learning

1. [Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning): the machine learning algorithm is presented with inputs ("observations") and their respective outputs ("targets" or "labels") and learns how to map the inputs to the outputs.
2. [Unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning): the algorithm is only presented with inputs ("observations") without outputs ("targets"). The goal is merely to find structure in the input data.


1. Classification: the target variable is categorical, i.e. the target belongs to a finite set of classes, e.g. {dog, cat, horse}, or {0, 1}.
    1. Binary classification: the target only belongs to one of two classes, e.g. {0, 1}.
    2. Multi-class classification: the target belongs to any of N classes, e.g. {0, 1, ..., N-1}.
2. Regression: the target variable is continuous, e.g. any real number.


## Load Data

Import a toy dataset:

In [2]:
from sklearn.datasets import load_iris
iris_data = load_iris()

Find the number of rows (observations) and columns (features) in the data set:

In [3]:
print iris_data.data.shape

(150, 4)


Each of the 150 observations has 4 features.

Get the names of the features:

In [4]:
iris_data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Get the names of the classes:

In [5]:
iris_data.target_names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

## Classification

Training a classification model in scikit-learn is real easy.

1. Initialize a model.
2. Train, i.e. learn, using the `fit` method.
3. Test, i.e. use the model, using the `predict` method.

First, initialize a model. There are many classifiers to choose from. Consult the [documentation](http://scikit-learn.org/stable/supervised_learning.html) for more info. Here is a support vector machine:

In [6]:
import sklearn.svm
model = sklearn.svm.SVC()

Second, train the model.

In [7]:
model.fit(iris_data.data, iris_data.target)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Finally, test the model, i.e. use it upon novel data inputs, i.e. observations. Here, we feed two novel observations and retrieve their predicted class labels:

In [8]:
model.predict([[7.2, 2.8, 6.6, 2], [6.2, 2.5, 3.6, 2]])

array([2, 1])

We could make predictions on all of the training data at once:

In [9]:
model.predict(iris_data.data)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The `score` method provides a quick way to gauge the accuracy of the model:

In [10]:
model.score(iris_data.data, iris_data.target)

0.98666666666666669

## Regression

Training a regression model in scikit-learn is no different.

1. Initialize a model.
2. Train, i.e. learn, using the `fit` method.
3. Test, i.e. use the model, using the `predict` method.

The only difference is the nature of the target variable which is any real number, not just a class label.

First, initialize a model:

In [11]:
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()

Second, train the model:

In [12]:
model.fit(iris_data.data[:, 0:2], iris_data.data[:, 2])



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Finally, test the model, i.e. use it upon novel data inputs, i.e. observations. Here, we feed two novel observations and retrieve their predicted class labels:

In [13]:
model.predict([[7.2, 2.8], [6.2, 2.5]])

array([ 6.5148047 ,  5.13576377])

Evaluate the model:

In [14]:
model.score(iris_data.data[:, 0:2], iris_data.data[:, 2])

0.86697289449897486