# USPS Data Prediction Using Daimensions

This dataset is from OpenML who describes the data as, "Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service."

## 0. Setup

We'll get the csv from the OpenML link and use a pandas dataframe to split it into training and validation data in csv's.

In [1]:
# using pandas to get csv as a dataframe and see how it looks
import pandas as pd
from sklearn.model_selection import train_test_split

dataset_url = 'https://www.openml.org/data/get_csv/19329737/usps.csv'
data = pd.read_csv(dataset_url)
data.describe()

Unnamed: 0,int0,double1,double2,double3,double4,double5,double6,double7,double8,double9,...,double247,double248,double249,double250,double251,double252,double253,double254,double255,double256
count,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,...,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0,9298.0
mean,4.89202,-0.9918,-0.972226,-0.930421,-0.852805,-0.733673,-0.578239,-0.391187,-0.22826,-0.220399,...,-0.292865,-0.118513,-0.138364,-0.357547,-0.595574,-0.766226,-0.874332,-0.936784,-0.970873,-0.989597
std,3.001086,0.050814,0.118296,0.195285,0.284053,0.372653,0.435317,0.452878,0.454537,0.446069,...,0.483898,0.453286,0.449512,0.456625,0.422421,0.340464,0.254392,0.183444,0.120247,0.058028
min,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2.0,-1.0,-1.0,-1.0,-0.999914,-0.996085,-0.96311,-0.787003,-0.620084,-0.571667,...,-0.742622,-0.430494,-0.46596,-0.770638,-0.968697,-0.997447,-0.999957,-1.0,-1.0,-1.0
50%,5.0,-1.0,-0.999992,-0.999608,-0.991661,-0.932991,-0.747495,-0.447743,-0.138583,-0.147614,...,-0.2836,-0.022176,-0.039908,-0.392889,-0.755935,-0.946957,-0.993475,-0.999771,-0.999996,-1.0
75%,7.0,-0.999969,-0.998444,-0.979572,-0.861493,-0.589829,-0.260331,0.000547,0.143727,0.148815,...,0.153227,0.251788,0.220543,0.033934,-0.306862,-0.654382,-0.885085,-0.979766,-0.99804,-0.999942
max,10.0,0.000308,0.332928,0.479436,0.523534,0.52737,0.531509,0.531319,0.531368,0.531327,...,0.53138,0.531834,0.531857,0.53183,0.531472,0.523678,0.52467,0.470479,0.314115,-0.162598


In [2]:
# split data into training and testing csv's, y is for the target column (int0)
y = data.int0
X = data.drop('int0', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
pd.concat([X_train, y_train], axis=1).to_csv('usps_train.csv',index=False)
pd.concat([X_test, y_test], axis=1).to_csv('usps_valid.csv',index=False)

### Installing Brainome via Pip
Simply run the cell below in order to install Brainome and be able to use it in terminal

In [3]:
# ! pip install brainome

## 1. Get Measurements

We always want to measure our data before building our predictor in order to ensure we are building the right model. For more information about how to use Daimensions and why we want to measure our data beforehand, check out the Titanic notebook. Don't forget to use -target int0 because the target column is not on the very right for this dataset.

In [4]:
! brainome -measureonly usps_train.csv -target int0


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -measureonly usps_train.csv -target int0

Start Time:                 08/02/2021, 13:05 PDT

Cleaning...done. 
Splitting into training and validation...done. 
Pre-training measurements...done. 


[01;1mPre-training Measurements[0m
Data:
    Input:                      usps_train.csv
    Target Column:              int0
    Number of instances:       7438
    Number of attributes:       256 out of 256
    Number of classes:           10

Class Balance:                
                               9: 7.64%
   

## 2. Build the Predictor

Based on our measurements, Daimensions recommends we use a neural network (higher expected generalization) and more effort for this dataset. Don't forget to use -target because the target column isn't on the very right.

In [5]:
! brainome -f NN usps_train.csv -o usps_predict.py -target int0 -e 5 --yes


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -f NN usps_train.csv -o usps_predict.py -target int0 -e 5 --yes

Start Time:                 08/02/2021, 13:06 PDT

Cleaning...done. 
Splitting into training and validation...done. 
Pre-training measurements...done. 


[01;1mPre-training Measurements[0m
Data:
    Input:                      usps_train.csv
    Target Column:              int0
    Number of instances:       7438
    Number of attributes:       256 out of 256
    Number of classes:           10

Class Balance:                
                    

## 3. Validate the Model

Now we can validate our model on our test data, a separate set of data that wasn't used for training.

In [6]:
! python3 usps_predict.py -validate usps_valid.csv 

Classifier Type:                    Neural Network
System Type:                        10-way classifier

Accuracy:
    Best-guess accuracy:            15.38%
    Model accuracy:                 93.54% (1740/1860 correct)
    Improvement over best guess:    78.16% (of possible 84.62%)

Model capacity (MEC):               322 bits
Generalization ratio:               17.71 bits/bit

Confusion Matrix:

      Actual |          Predicted           
    --------------------------------------------------
           9 |128   0   3   1   0   0   2   0   6   0
           3 |  1 178   0   0   2   3   1   0   3   2
           2 |  0   0 264   0   0   0   0   1   0   0
          10 |  1   1   0 146   1   4   2   0   1   7
           1 |  0   2   0   1 275   2   3   2   1   0
           5 |  0   5   1   4   2 131   0   3   1   0
           6 |  1   0   0   0   2   1 149   0   5   1
           7 |  1   3   1   0   0   1   1 151   0   0
           4 |  5   5   0   0   0   0   8   0 157   0
           

Hooray! We have validated the accuracy of our model and found that it has a 93.54% accuracy for the test data. We can also see the confusion matrix, which tells us the percentage of data points from each class (columns) that were predicted to be in a certain class (rows). The diagonals are correctly predicted data points.