In [0]:
%pylab inline

In [0]:
import sklearn.datasets
import sklearn.linear_model
import sklearn.model_selection
import sklearn.metrics
import sklearn.multiclass
import sklearn.preprocessing

# Machine learning

Several most relevant modules of the `scikit-learn` library are imported above. The API is documented on the following website:

http://scikit-learn.org/stable/modules/classes.html

It is also worth checking out:

http://scikit-learn.org/stable/user_guide.html

Let's first load and study the digits database stored in the method `sklearn.datasets.load_digits()`:

Load the database again with the argument `return_X_y` set to `True`:

Draw a few sample images from the database:

## Spliting the data to test and train

This is easily achievable using the`sklearn.model_selection.train_test_split()` method. The `test_size` argument allows choosing the ratio of the data used for testing.

Divide the dataset randomly: 90% for training and 10% testing. Draw the histogram of classes in the test dataset:

### Stratification

Scikit-learn allows for easy stratification of the data allowing for an equal representation of classes within the dataset. Use the `sklearn.model_selection.StratifiedShuffleSplit` class to split the data similarly as above. This class can be used to split the data several times, but for this example we need only to set `n_splits` to 1. The result of the `split` method of this class is a generator (normally used within loops), but you can use the built-in `next` method to retreive the first (and only) split.

Generate a stratified split of data (just like above) and draw its test class histogram:

## Classification

We will use a very simple linear model called "logistic regression". The `sklearn.linear_model.LogisticRegression` class contains many initialization parametrs, but we'll use the default for now. 

Use the `fit` method with the training and then `predict` method with test data:

### Evaluation

The `sklearn.metrics` module contains many evaluation methods, but it's important to understand exactly what they do. Some work only for specific classification methods, so it's a good idea to read the documentation if something goes wrong.

The simplest most verstaile metric is `sklearn.metrics.accuracy_score`. Calculate it for the above result.

You can also try computing precision/recall/F1, however note they work for binary classification only. They can nvertheless be computed by estimating each class individually and averaging the result. Change the parameter `average` to `micro`, `macro` or `weighted`.

#### Confusion matrix

A confusion matrix is a very useful tool for analyzing errors. Compute and draw it using `sklearn.metrics.confusion_matrix`:

## Cross validation

For small datasets, a more accurate result can be achieved using cross validation. Scikit-learn has several convenience methods like `sklearn.model_selection.KFold`, or even better `sklearn.model_selection.StratifiedKFold`.

Create a 5-fold cross validation object and use a for loop to repeak the above experiment, saving the results along for each fold. Finally provide the mean and standard devation of the result.

## ROC Curve

The ROC Curve can be calculated only for binary problems. That is why we need a similar "averageing" method as with precision/recall. Furthermore, this method can be made more precise if we are given a probability instead of only a binary classification result. This allows us the set a threshold value based on the probability score.

Retreain the logistic regression morel, but use `decision_funcion` instead of `predict` to get a more accurate result:

If we want to compare the result with reference labels, we first need to convert the rsult to binary. Use `sklearn.preprocessing.label_binarize`:

Now we can simply compute the ROC curve for any class using `sklearn.metrics.roc_curve` anc choosing a specific class from the array. This method returns 3 values, of which we need only the first to (FPR,TPR). Send them to the `plot` method, as well as to`sklearn.metrics.auc` to compute the area under the curve. Also draw a line between (0,0) and (1,1) to denote the random classification cutoff:

You can finally also draw the averaged plot for all classes (so called "micro" method) by using `ravel()` on the results and reference.

More info about this: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

In [0]:
fpr,tpr,_=sklearn.metrics.roc_curve(y_tst_bin.ravel(),h_tst.ravel())
auc=sklearn.metrics.auc(fpr,tpr)
plot(fpr,tpr)
plot([0,1],[0,1],'k--')
xlim(0,1)
ylim(0,1.05)
title('Krzywa ROC (AUC: {:%})'.format(auc))
xlabel('False positive rate')
ylabel('True positive rate')