### General Usage of the Notebook: 

The following code will perform the action of reading the downloaded images of SVHN, leverage the blue bounding box of each individual image, construct a larger bounding box to include all digits within one image, increase the newly constructed bounding box by 30% and resize the image to 54*54*3 size

### Data Source & Description:

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. The dataset is similar in flavor to MNIST but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem. SVHN is obtained from house numbers in Google Street View images. There are three datasets available, train data (size 33402), test data (size 13068), and extra dataset.


source: http://ufldl.stanford.edu/housenumbers/

### Final Output:

The final output of this jupyter notebook gives two pickle files named "trainpkl.gz" and "testpkl.gz". These two dataset will be used for the later model training process. 


In [2]:
# Total Number of dataset 
Test_no = 13068

In [3]:
#This is to read the training image from folder into jupyter notebook 
from scipy import misc
import glob
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

X_train=[]
#Make sure you change directory to your local directory where the unzipped files are stored
directory = "./test/"
img_type=".png"
for i in range(1,Test_no):
    image_name=[i,img_type]
    values = ''.join(str(v) for v in image_name)
    folder=[directory,values] 
    folder=''.join(folder)
    image = misc.imread(folder)
    X_train.append(image)


`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.


In [4]:
#This is the function to get the blue box position from the mat file
def get_box_data(index, hdf5_data):
    
    meta_data = dict()
    meta_data['height'] = []
    meta_data['label'] = []
    meta_data['left'] = []
    meta_data['top'] = []
    meta_data['width'] = []

    def print_para(name, obj):
        vals = []
        if obj.shape[0] == 1:
            vals.append(obj[0][0])
        else:
            for k in range(obj.shape[0]):
                vals.append(int(hdf5_data[obj[k][0]][0][0]))
        meta_data[name] = vals

    box = hdf5_data['/digitStruct/bbox'][index]
    hdf5_data[box[0]].visititems(print_para)
    return meta_data

def get_para(index, hdf5_data):
    name = hdf5_data['/digitStruct/name']
    return ''.join([chr(v[0]) for v in hdf5_data[name[index][0]].value])

In [5]:
# Run this cell will be able to provide you the cropped image in 54*54*3 format for testing dataset
import h5py
import matplotlib.pyplot as plt
import cv2

#Make sure you change directory to your local directory where the mat file is stored
#.mat file location
mat_data = h5py.File('./test/digitStruct.mat')
size = mat_data['/digitStruct/name'].size

#print (box)
label_Final = []
Pic_crop_Final = []
#obtain label and basic parameters of each image
for i in range(Test_no-1):
    Array = np.array(X_train[i])
    pic = get_para(i, mat_data)
    box = get_box_data(i, mat_data)
    label = box['label']
    label_Final.append(''.join(str(int(x%10)) for x in label))

    #Increase location by 30 on any edge of the box%
    
    H = int(round((max(box['top'])+max(box['height'])) *1.03))
    L = int(round (min(box['left']) *(0.97)))
    #ensure there is no negative dimension due to some of the picture have the weird locations
    if L<0:
        L =0
    T = int(round (min(box['top'])*(0.97)))
    if T<0:
        T = 0
    W = int(round((max (box ['left'])+ max(box['width']))*1.03))
    #Crop the image
    Pic_crop = Array [T:H,L:W]

    #resize data and append to the created list

    res_Pic_crop = cv2.resize(Pic_crop, dsize=(54, 54), interpolation=cv2.INTER_CUBIC)
    Pic_crop_Final.append(res_Pic_crop)

    

In [6]:
#Reshape the data from 54,54,3 into 54*54*3, and convert both label and image into arrays
import numpy as np
c = []
for i in range(Test_no-1):
    a = np.reshape(Pic_crop_Final[i],54*54*3)
    c.append(a)
c = np.asarray(c)
d = np.asarray(label_Final)

In [7]:
#Create pickle file and close the file
import six.moves.cPickle as pickle
import os
import sys
import gzip

#Please change this directory to your location directory
data_dir = '/Users/Shimeng/Documents/Master_Columbia_2018/E4040 NN and Deep Learning/Final Project/'
output_file = 'testpkl.gz'
out_path = os.path.join(data_dir, output_file)

#create document format
out = {}
out['labels'] = d
out['images'] = c

#save data
p = gzip.open(out_path, 'wb')
pickle.dump(out, p)
p.close()