# Handwriting Recognition Using Logistic Regression

## Identify the Problem

Identifying handwritten characters is a common problem in computer vision applications. For example, the USPS uses algorithms to recognize handwritten addresses on envelopes to enable automated mail sorting. Solving this problem using traditional strategies (i.e. edge detection) can be challenging. Can we use data science techniques to recognize handwritten digits (e.g. 0, 1, 2, ... , 9) with a reasonable degree of accuracy?

## Acquire the Data
The MNIST data set is a collection of 70,000 labeled images of handwritten digits. http://yann.lecun.com/exdb/mnist/

We use a pre-processed and pickled version of the dataset made by deeplearning.net. The pickled version can be downloaded from: http://deeplearning.net/data/mnist/mnist.pkl.gz

In [1]:
import os
from sklearn import linear_model, cross_validation, metrics, preprocessing
import pandas as pd
import matplotlib.pyplot as plt
import cPickle
import gzip
import numpy as np

In [2]:
# for convinience, we use a pickled version of the MNIST dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

## Parse, Mine, and Refine the Data

'train_set' is a tuple. The first element is an array where each row represents an image. An image consists of 28x28 pixels where each pixel is represented as a float.

The second element is an array of the corresponding labels for each image.

We can extract a single training example and plot it using pyplot.

<img src='assets\example_image.png'>

In [3]:
# PLOT IMAGE
# image is rc/binary format
plt.rc('image', cmap='binary')
# reshape first training image into 28x28 format
plt.matshow((train_set[0][0]).reshape(28, 28))
# label of first training image
plt.title("Label: %d" % train_set[1][0])
plt.show()

print(repr(train_set[0]))

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)


The 70,000 images are split as follows:
* 50,000 are put into the training set
* 10,000 are put into the validation set
* 10,000 are put into the test set

In [4]:
print("Length of training set: %d" % len(train_set[0]))
print("Length of validation set: %d" % len(valid_set[0]))
print("Length of test set: %d" % len(test_set[0]))

Length of training set: 50000
Length of validation set: 10000
Length of test set: 10000


## Build a Model

Since this is a classification task, we use train a logistic regression model to classify images into one of 10 categories: 0, 1, 2, ..., 9.

In [5]:
model = linear_model.LogisticRegression(verbose = 1)
model.fit(train_set[0][0:20000], train_set[1][0:20000])

[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=1, warm_start=False)

In [6]:
pred = model.predict(test_set[0])
print metrics.accuracy_score(test_set[1], pred)
print zip(pred[0:100], test_set[1][0:100])
for tuple in zip(pred[0:100], test_set[1][0:100]):
    if tuple[0] != tuple[1]:
        print tuple

0.9114
[(7, 7), (2, 2), (1, 1), (0, 0), (4, 4), (1, 1), (4, 4), (9, 9), (6, 5), (9, 9), (0, 0), (6, 6), (9, 9), (0, 0), (1, 1), (5, 5), (9, 9), (7, 7), (5, 3), (4, 4), (9, 9), (6, 6), (6, 6), (5, 5), (4, 4), (0, 0), (7, 7), (4, 4), (0, 0), (1, 1), (3, 3), (1, 1), (3, 3), (4, 4), (7, 7), (2, 2), (7, 7), (1, 1), (3, 2), (1, 1), (1, 1), (7, 7), (4, 4), (2, 2), (3, 3), (5, 5), (1, 1), (2, 2), (4, 4), (4, 4), (6, 6), (3, 3), (5, 5), (5, 5), (6, 6), (0, 0), (4, 4), (1, 1), (9, 9), (5, 5), (7, 7), (2, 8), (9, 9), (2, 3), (7, 7), (4, 4), (7, 6), (4, 4), (3, 3), (0, 0), (7, 7), (0, 0), (2, 2), (9, 9), (1, 1), (7, 7), (3, 3), (1, 2), (9, 9), (7, 7), (7, 7), (6, 6), (2, 2), (7, 7), (8, 8), (4, 4), (7, 7), (3, 3), (6, 6), (1, 1), (3, 3), (6, 6), (9, 9), (3, 3), (1, 1), (4, 4), (1, 1), (7, 7), (6, 6), (9, 9)]
(6, 5)
(5, 3)
(3, 2)
(2, 8)
(2, 3)
(7, 6)
(1, 2)


## Mean Normalization
Let's normalize the data to see if that improves performance! 

In [7]:
train_X = train_set[0][0:20000]

train_y = train_set[1][0:20000]
X_scaled = preprocessing.scale(train_X)



In [8]:
model_scaled = linear_model.LogisticRegression(verbose = 1)
model_scaled.fit(X_scaled, train_y)

[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=1, warm_start=False)

In [9]:
pred = model_scaled.predict(preprocessing.scale(test_set[0]))
print metrics.accuracy_score(test_set[1], pred)
for tuple in zip(pred[0:100], test_set[1][0:100]):
    if tuple[0] != tuple[1]:
        print tuple

0.9013
(6, 5)
(8, 3)
(3, 2)
(2, 8)
(1, 6)
(1, 2)


Our accuracy went down by 1% after we scaled the data. One possible explanation for this is that our dataset has small numbers that are sparse (i.e. the number of 0s outnumber the number of non-zero values). This means that the standard deviation is very close to 0 and that can cause numeric issues when scaling the data. In this case, it is probably better to use the raw data.

# Present the Results

Using logistic regression, we correctly recognize images in the test set ~92% of the time. This is pretty accurate but our model is probably not good enough to be used in critical services like the USPS.

We can also graph the learned coefficients for each digit to get an visualization of how our logistic regression model has learned the data.

In [11]:
# display coefficients for each digit
for val, rep in enumerate(model.coef_):
    plt.rc('image', cmap='binary')
    # reshape first training image into 28x28 format
    plt.matshow((rep.reshape(28, 28)))
    # label of first training image
    plt.title(val)
    plt.show()

<img src='assets\1.png'>
<img src='assets\2.png'>
<img src='assets\3.png'>
<img src='assets\4.png'>
<img src='assets\5.png'>
<img src='assets\6.png'>
<img src='assets\7.png'>
<img src='assets\8.png'>
<img src='assets\9.png'>
