# Classification

### MA755 Machine Learning - 21 Feb 2017

### Dataset: MNIST

- http://yann.lecun.com/exdb/mnist/

In [1]:
import pickle
import matplotlib
import matplotlib.pyplot as plt
import numpy         as np



In [2]:
%matplotlib inline

In [3]:
from sklearn.datasets.mldata import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model    import SGDClassifier
from sklearn.linear_model    import LogisticRegression
from sklearn.preprocessing   import LabelBinarizer
from sklearn.metrics         import confusion_matrix
from sklearn.multiclass      import OneVsRestClassifier
from sklearn.multiclass      import OneVsOneClassifier

When running this notebook for the first time change the following cell to a code cell and then run it. 

The two commands in that cell:
- get the MNIST data using `fetch_mldata`
- store the data locally in a pickle file

mnist = fetch_mldata('MNIST original')
pickle.dump(mnist, open( "mnist.p", "wb" ))

Use the following command, which loads the dataset from the pickle file, after you have saved the data in a pickle file with the previous command.

In [4]:
mnist = pickle.load(open( "mnist.p", "rb" ))

In [5]:
print("mnist.data.shape  ", mnist.data.shape)
print("mnist.target.shape", mnist.target.shape)

mnist.data.shape   (70000, 784)
mnist.target.shape (70000,)


Notice that the target variable has `10` values (as expected.)

In [6]:
print(np.unique(mnist.target.astype(int)))

[0 1 2 3 4 5 6 7 8 9]


### Train and test datasets

Our goal is to find the best predictive model on the given (mnist) dataset. 

The dataset should be split into a _training dataset_, a _validation dataset_ and a _testing dataset_ where:

- The training dataset is used to create the models
- Each model is evaluated on the validation dataset
- The best model is then evaluated on the testing dataset

The reason behind this splitting of the datasets is that a model shouldn't be evaluated on the same data used to create it. This applies in two cases:

1. Each of the initial models, created from the training dataset, should be evaluated on different dataset (the training dataset) than were used to create it
1. The best model was chosen by comparing it to many other models. This won't be the case when this model is used in practice. So we evaluate this (best) model on unseen data, which is the testing dataset

For more details on this process see [Wikipedia](https://en.wikipedia.org/wiki/Test_set#Validation_set).

We split the MNIST dataset into training datasets and testing datasets as follows. 

We have a single dataset `mnist.data` of independent variables, which is split into a training dataset and a testing dataset. 

We have three target datasets:
1. `mnist.target` which is an integer variable with values `0` through `9`
1. `mnist_target6` which is a binary variable that is:
    - 1 if the corresponding value in `mnist.target` is a `6` 
    - 0 if the corresponding value in `mnist.target` is not a `6`
1. `mnist_target_1hot` which is ten (10) dummy variables corresponding to <break>each of the ten (10) possible values of `mnist.target`

The three dependent datasets are used to demonstrate prediction on three types of categorical target variables:
- `mnist_target6`: use of binary categorical target variables
- `mnist.target`: use of multi-class categorical variables with integer values
- `mnist_target_1hot`: use of multi-class categorical variables with dummy variables

These training and testing datasets are created below, but first the creation of dummy variables is demonstrated. 

In [7]:
mnist_target_1hot = LabelBinarizer().fit_transform(mnist.target)
mnist_target_1hot, mnist.target.shape, mnist_target_1hot.shape

(array([[1, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 1],
        [0, 0, 0, ..., 0, 0, 1],
        [0, 0, 0, ..., 0, 0, 1]]), (70000,), (70000, 10))

Now the datasets are created.

In [8]:
(train_data,        test_data, 
 train_target,      test_target, 
 train_target_6,    test_target_6, 
 train_target_1hot, test_target_1hot
) = train_test_split(mnist.data, 
                     mnist.target,
                     (mnist.target==6).astype(int), 
                     LabelBinarizer().fit_transform(mnist.target),
                     test_size=0.2, 
                     random_state=42)

(train_data.shape,        test_data.shape, 
 train_target.shape,      test_target.shape,
 train_target_6.shape,    test_target_6.shape, 
 train_target_1hot.shape, test_target_1hot.shape
)

((56000, 784),
 (14000, 784),
 (56000,),
 (14000,),
 (56000,),
 (14000,),
 (56000, 10),
 (14000, 10))

Notice the number of rows in each train/test dataset and the number of rows in the 1hot datasets.

The training datasets are now split into training and validation datasets using __cross-validation__.

The steps in the cross-validation process are:
1. Partition the training dataset into `k` (equally sized) subsets.
You can choose `k`, but it is usually `10`.
1. For each subset of the training dataset: 
    1. Designate that subset as the validation dataset, the other partitions are the training dataset
    1. Train your model on the training dataset
    1. Using this model, make predictions on the validation set 
    1. Evaluate the performance of this model by checking the predictions made (with it) on the validation set
1. Return a list containing the performance of each model on its corresponding validation dataset

The cross-validation procedure is used below in the demonstrations of the different types of categorical target variables. 

### Types of target categorical variables 

Below we demonstrate predictions with three types of categorical target variable:
- __binary__: two values (`0` and `1`)
- __multi-class__: more than two integer values
- __multi-class__: two or more binary/dummy variables
- __multi-label__: two or more categorical variables (see [sklearn](http://scikit-learn.org/stable/modules/multiclass.html))

### Classifying binary target variables

Now we walk through the steps of creating a classifier and using cross-validation to check its performance.

First create a classifier. (We use the defaults as the details of the classifier are not important at this point.)

In [9]:
sgd_clf = SGDClassifier()

In [10]:
test_predict = sgd_clf.fit(train_data, train_target_6).predict(test_data)

The `confusion_matrix` function creates a table to compare the two variables:

- the rows represent values of `train_target_6` (in this case `0` and `1`)
- the columns represent values of `train_predict` (in this case `0` and `1`)
- the cells display the number of rows (with those two values)

In [11]:
confusion_matrix(test_target_6, test_predict)

array([[12334,   259],
       [   68,  1339]])

Use `cross_val_predict` with the `SGDClassifier` to create predictions for every row in the dataset `train_data`. 

In [12]:
train_predict = cross_val_predict(sgd_clf, train_data, train_target_6, cv=3)
train_target_6.shape, train_predict.shape

((56000,), (56000,))

In [13]:
confusion_matrix(train_target_6, train_predict)

array([[49947,   584],
       [  965,  4504]])

The `cross_val_score` function is used to score each of the three (3) validation sets.

In [14]:
cross_val_score(sgd_clf, train_data, train_target_6, cv=3, scoring="accuracy")

array([ 0.97503616,  0.97632185,  0.97878496])

There are other methods to evaluate (the predictions of) the model, but we will explore them later.

### Multi-class categorical target variables (integers)

In [20]:
sgd_clf = SGDClassifier()

In [21]:
test_predict = sgd_clf.fit(train_data, train_target).predict(test_data)

In [22]:
confusion_matrix(test_target, test_predict)

array([[1309,    0,    9,    3,    2,    8,    4,    3,    7,    4],
       [   0, 1557,    8,    2,    1,    7,    0,    3,    1,    2],
       [  14,   28, 1207,   55,   14,   10,   12,   28,   26,    6],
       [  11,   10,   28, 1266,    4,   58,    4,   31,   11,   11],
       [   2,    4,   10,    1, 1244,    2,    4,    4,    4,   53],
       [  16,   22,   16,   48,   42, 1054,   21,    6,   35,   26],
       [  16,    5,   41,    2,   13,   18, 1305,    4,    3,    0],
       [   6,    6,    9,    5,   16,    5,    2, 1341,    2,   84],
       [  20,  106,   25,   78,   87,   79,   21,   33,  849,   93],
       [   7,    9,    9,   27,  104,   19,    0,   97,    4, 1072]])

Use `cross_val_predict` with the `SGDClassifier` to create predictions for every row in the dataset `train_data`. 

In [23]:
train_predict = cross_val_predict(sgd_clf, train_data, train_target, cv=3)
train_target.shape, train_predict.shape

((56000,), (56000,))

Use the `confusion_matrix` function to compare the target values and the predictions.

In [24]:
confusion_matrix(train_target, train_predict)

array([[5296,    2,   46,   23,   44,   28,   28,    7,   76,    4],
       [   1, 6080,   77,   13,    9,   40,    6,   15,   42,   13],
       [  35,   48, 4947,   72,  123,   15,   62,   73,  197,   18],
       [  33,   39,  311, 4728,   31,  205,   15,   61,  219,   65],
       [   9,   31,   31,   16, 5187,    5,   26,   30,   44,  117],
       [  74,   45,   97,  341,  215, 3630,   85,   34,  439,   67],
       [  42,   44,  184,    7,  146,   58, 4918,   16,   53,    1],
       [  32,   35,  128,   21,  115,   20,    2, 5221,   55,  188],
       [  36,  204,  564,  171,  214,  459,   25,   39, 3665,   57],
       [  35,   37,   62,   99,  658,   61,    1,  489,   80, 4088]])

The `cross_val_score` function is used to score each of the three (3) validation sets.

In [25]:
cross_val_score(sgd_clf, train_data, train_target, cv=3, scoring="accuracy")

array([ 0.87059454,  0.85192328,  0.87291042])

### Multi-class categorical target variables (1hot)

Problems: 1hot vectors __ARE NOT__ multi-class. Maybe that is why I get an error.

In [None]:
sgd_clf = SGDClassifier()

Use the default classifier (created above) to create predictions (with `cross_val_predict`) for every row in the dataset `train_data`. 

In [None]:
clf = LogisticRegression() #SGDClassifier(loss='log')
train_predict = OneVsRestClassifier(clf).fit(train_data,
                                                 train_target)

In [None]:
train_predict.predict_proba(test_data)

In [None]:
sgd_clf = SGDClassifier(loss='log')
train_proba = OneVsOneClassifier(sgd_clf).fit(train_data,
                                              train_target_1hot).predict(train_data)
train_target_1hot.shape, train_predict.shape

Use the `confusion_matrix` function to compare the target values and the predictions.

In [None]:
confusion_matrix(train_target, train_predict)

The `cross_val_score` function is used to score each of the three (3) validation sets.

In [None]:
cross_val_score(sgd_clf, train_data, train_target, cv=3, scoring="accuracy")

In [26]:
from sklearn.utils.testing import all_estimators

estimators = all_estimators()

for name, class_ in estimators:
    if hasattr(class_, 'predict_proba'):
        print(name)

ImportError: No module named 'nose'

Logistic regression (by parameter `loss='log'`)

In [None]:
sgd_clf = SGDClassifier(loss='log', random_state=42)
sgd_clf.fit(train_data, train_target)

In [None]:
sgd_clf.predict_proba(train_data)

In [None]:
log_reg = LogisticRegression()
log_reg.fit(train_data, train_y)

In [None]:
test_p = log_reg.predict(test_data)

In [None]:
test_p.shape, test_y.shape, test_data.shape

In [None]:
plt.imshow(test_data[11000].reshape(28, 28), 
           cmap = matplotlib.cm.binary,
           interpolation="nearest")

### Classifying multi-class variables

The `SGDClassifier` can take multi-class categorical variables as integers, and will treat them correctly (as categorical variables.)

In [None]:
sgd_clf = SGDClassifier()
log_reg = LogisticRegression()

In [None]:
train_p = cross_val_predict(sgd_clf, _data, train_y, cv=3)
train_y.shape, train_p.shape

In [None]:
train_data.shape, train_y.shape

In [None]:
train_p = cross_val_predict(sgd_clf, train_data, _y, cv=3)
train_y.shape, train_p.shape

In [None]:
train_p

In [None]:
train_p = cross_val_predict(log_reg, train_data, train_y, cv=3,  
                            method='predict_proba')
train_y.shape, train_p.shape

In [None]:
cm_y_p = confusion_matrix(train_y, train_p)
cm_y_p