## Problem MNIST digit recognition

The goal of this notebook is to compare logistic regression and Bayesian logistic regression in the problem of digits recognition.

In particular, our goal is to implement a binary classification that is able to distinguish 
the digits 5 versus 8. I have implemented the logistic regression in Sklearn for you. Your goal is to implement Bayesian Logistic regression (similarly to the one used for the Titanic dataset) using PyMC3. You have to compute the accuracy of the Bayesian logistic classifier (use the posterior mean for prediction) in the Test set and compare it with that of Sklearn.

More important, you have to use the uncertainty to highlight the instances in the test set that are more difficult to classify for the Bayesian classifier. Each sampled probability determines a prediction, class 0 and class 1. A way
to evaluate the uncertainty on the prediction for a certain instance in the test set is to compute
the standard deviation of these different predictions.
If the standard deviation is small, then it means that the decision was always class 0 (or class 1), while if the standard deviation  is high it means that the probabilistic classifier is undecided about how to classify that instance. 

1. plot this uncertainty against the mean predicted probability as we did for the Titanic dataset
2. plot this uncertainty against the  predicted probability returned by Sklearn as we did for the Titanic dataset

What do you notice?


In [1]:
from scipy.io import loadmat
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
%matplotlib inline
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML
import random
import arviz as az
import theano as tt

MNIST dataset is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. 

You can download it from here

https://osf.io/jda6s/

In [2]:
mnist = loadmat("../datasets/mnist-original.mat")
mnist_data = mnist["data"].T
mnist_label = mnist["label"][0]

In [3]:
mnist_label

array([0., 0., 0., ..., 9., 9., 9.])

In [4]:
## We only consider two digits to perform binary classification and only use 100 instances
#per digit to make learning faset
np.random.seed(0)
digits = [5,8]
N_per_digit =500
X = []
labels = []
for d in digits:
    imgs = mnist_data[np.where(mnist_label==d)[0],:]
    X.append(imgs[np.random.permutation(imgs.shape[0]),:][0:N_per_digit,:])
    labels.append(np.ones(N_per_digit)*d)
X = np.vstack(X).astype(np.float64)
y = np.hstack(labels)
X = X / 255.0
y = np.where(y==5,1,0)#class 3 ->1, class 8->0

In [83]:
X.shape

(1000, 784)

In [86]:
## Split data in training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [87]:
X_train.shape

(850, 784)

## Sklearn  Logistic regression

In [92]:
clf = LogisticRegression(random_state=0, C=900, max_iter=2000, solver='lbfgs').fit(X_train, y_train)
y_pred_LR = clf.predict(X_test)
y_pred_prob_LR = clf.predict_proba(X_test)[:,1]#class 1 probability
print(y_pred_LR[0:3])
print("Accuracy=",accuracy_score(y_pred_LR,y_test))

[0 0 1]
Accuracy= 0.9
