# An introduction to machine learning with scikit-learn

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

## Loading an example dataset

scikit-learn comes with a few standard datasets, for instance the <i>iris</i> and <i>digits</i> datasets for classification and the <i>boston house prices</i> dataset for regression.

Load the iris and digits datasets:

In [16]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target member. More details on the different datasets can be found in the dedicated section.

## Learning and predicting

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC that implements support vector classification. The constructor of an estimator takes as arguments the parameters of the model, but for the time being, we will consider the estimator as a black box:

In [17]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

#### Choosing the parameters of the model
In this example we set the value of gamma manually. It is possible to automatically find good values for the parameters by using tools such as grid search and cross validation.

We call our estimator instance clf, as it is a classifier. It now must be fitted to the model, that is, it must learn from the model. This is done by passing our training set to the fit method. As a training set, let us use all the images of our dataset apart from the last one. We select this training set with the [ :-1] Python syntax, which produces a new array that contains all but the last entry of digits.data:

In [22]:
clf.fit(digits.data[:-1], digits.target[:-1])  

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Now you can predict new values, in particular, we can ask to the classifier what is the digit of our last image in the digits dataset, which we have not used to train the classifier:

In [23]:
clf.predict(digits.data[-1:])

array([8])