# Handwritten Digits Prediction

<p align="center">
<img src="img/digits.gif">
</p>

The MNIST database contains binary images of **handwritten digits**. The original black and white images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. The images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.The database has a training set of 60,000 examples, and a test set of 10,000 examples. There are 10 classes (one for each of the 10 digits). **The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images**.

## 0.0. Imports

In [21]:
import pandas as pd

import random

import scikitplot        as skplt

from sklearn.preprocessing import MinMaxScaler

from mnist import MNIST

import warnings
warnings.filterwarnings( "ignore" )

## 0.1. Helper Functions

In [None]:
# Plot confusion matrix
def confusion_matrix( y, predictions ):
    skplt.metrics.plot_confusion_matrix( y, predictions )
    plt.figure( figsize=(12, 8) )
    plt.show()
    
    return None

## 0.2. Loading Data

In [1]:
# Load data
mndata = MNIST( 'datasets/images_handwritten_digits' )

image_train, label_train = mndata.load_training( )
image_test, label_test = mndata.load_testing( )

In [4]:
# View an image
index = random.randrange( 0, len( image_train ) ) 
print( mndata.display( image_train[index] ) )


............................
............................
...........@@...............
..........@@@...............
.........@@@@@..............
.........@@@@...............
.........@@@................
........@@@@................
........@@@@................
........@@@.................
........@@@....@@@..........
........@@@@..@@@@@@........
........@@@@.@@@@@@@@.......
........@@@@.@@@@@@@@@......
........@@@@@@@@@.@@@@@.....
.........@@@@@@@@...@@@.....
.........@@@@@@@@...@@@@....
..........@@@@@@@@..@@@@....
..........@@@@@@@@@@@@@@....
...........@@@@@@@@@@@@@....
.............@@@@@@@@@@.....
.................@@@@@......
............................
............................
............................
............................
............................
............................


In [5]:
# Data transformation
image_train = pd.DataFrame( image_train )
image_test = pd.DataFrame( image_test )
label_train = pd.DataFrame( label_train )
label_test = pd.DataFrame( label_test )

In [6]:
# Joining image Dataframes
X = pd.concat( [image_train, image_test], ignore_index=True )

# Joining target Dataframes
y = pd.concat( [label_train, label_test], ignore_index=True )

## 1.0. Data Description

In [8]:
# Data dimensions
print('Número de linhas: ', X.shape[0])
print('Número de colunas: ', X.shape[1])

Número de linhas:  70000
Número de colunas:  784


In [12]:
# Data types
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 784 entries, 0 to 783
dtypes: int64(784)
memory usage: 418.7 MB


In [20]:
# checking missing values
X.isnull().sum().sort_values(ascending=False)

783    0
268    0
266    0
265    0
264    0
      ..
520    0
519    0
518    0
517    0
0      0
Length: 784, dtype: int64

In [22]:
# 2.0. Data preparation
X = pd.DataFrame(MinMaxScaler().fit_transform(X))