## Data Prep

Raw data can be downloaded from the following link:<br>
https://datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/


***

- This notebook is part of sturcting the data in the required format.
- As part of the hackathon we are given all images in one folder, while some CNN Architecture deals with the data in a different way where we need to have different folders for each of the class category.
- Here we will be splitting the raw data files into different folders


*** 
- Loading up all the images with their scaled up version can eat up way more memory than what you can imagine even for dataset like MNIST.
- So to deal with larger dataset, instead of loading data at one go, we can use "flow_from_directory" which processes the images in batches and so can process any amount of images

In [1]:
# Create folder structure for available data

In [2]:
from shutil import copyfile, move, rmtree

In [3]:
import os
import numpy as np
import pandas as pd

Rename csv files to train and test respectively

In [5]:
# Location of the data extracted at
location='/home/arpit/notebooks/data/av/mnist/'

In [6]:
train_raw=pd.read_csv(location + 'train.csv')
test_raw=pd.read_csv(location + 'test.csv')

In [7]:
# Get the list of available classes/ labels
labels = list(train_raw.label.unique())
labels

[4, 9, 1, 7, 3, 2, 6, 0, 8, 5]

In [8]:
# Clean the directories
def clean_data():
    os.makedirs("data/train")
    os.makedirs("data/validation")
    os.makedirs("data/test")
    rmtree('data/train')
    rmtree('data/validation')   
    rmtree('data/test')

In [10]:
# Creates up the required folder sturcture for train, test, validation
def create_folders(labels):

    try:
        if not os.path.exists("data"):
            os.makedirs("data/train")
            os.makedirs("data/validation")
            os.makedirs("data/test")
        for digit in labels:
            directory = 'data/train/' + str(digit)
            directory_valid = 'data/validation/' + str(digit)
            if not os.path.exists(directory):
                os.makedirs(directory)
            if not os.path.exists(directory_valid):
                os.makedirs(directory_valid)
        print("Successfully created Folder Structure...!!!")
    except:
        None

In [13]:
# Prepares the training dataset
def prepare_data(data, labels, location):
    
    clean_data()
    create_folders(labels)
    
    for index, row in data.iterrows():
        if index % 5000 == 0:
            print("# Files copied: " + str(index))
        
        src_file = location + 'Images/train/' + row[0]
        dst_file = 'data/train/' + str(row[1]) + '/' + row[0]
        copyfile(src_file, dst_file)
        
    print("Process Complete...!!!")

In [14]:
def file_count(directory):
    path, dirs, files = next(os.walk(directory))
    file_count = len(files)
    print(path, file_count)

In [15]:
# Prepare validation dataset
def validation_data(valid_count):
    for directory in os.listdir('data/train'):
        path = 'data/train/' + directory + '/'
        digit = path[-2:-1]

        num_files = valid_count    

        for index, filename in enumerate(os.listdir(path)):
            if index // num_files == 1:
                print("Copied " + str(num_files) + " in data/validation/" + str(digit) + '/') 
                break

            src_file = path + filename
            dst_file = 'data/validation/' + digit + '/' + filename

            move(src_file, dst_file)


In [16]:
%%time
prepare_data(train_raw, labels, location)

Successfully created Folder Structure...!!!
# Files copied: 0
# Files copied: 5000
# Files copied: 10000
# Files copied: 15000
# Files copied: 20000
# Files copied: 25000
# Files copied: 30000
# Files copied: 35000
# Files copied: 40000
# Files copied: 45000
Process Complete...!!!
CPU times: user 6.53 s, sys: 2.35 s, total: 8.88 s
Wall time: 14.6 s


In [17]:
%%time
validation_data(500)

Copied 500 in data/validation/6/
Copied 500 in data/validation/7/
Copied 500 in data/validation/3/
Copied 500 in data/validation/9/
Copied 500 in data/validation/4/
Copied 500 in data/validation/0/
Copied 500 in data/validation/8/
Copied 500 in data/validation/2/
Copied 500 in data/validation/1/
Copied 500 in data/validation/5/
CPU times: user 51.9 ms, sys: 63.2 ms, total: 115 ms
Wall time: 133 ms


In [18]:
def prepare_test_data(data, location):
    
    if not os.path.exists("data/test/images"):
            os.makedirs("data/test/images")
            
    for index, row in data.iterrows():
        src_file = location + 'Images/test/' + row[0]
        dst_file = 'data/test/images/' + row[0]
        copyfile(src_file, dst_file)
    
    print("Test data created...!!!")

In [20]:
%%time
prepare_test_data(test_raw, location)

Test data created...!!!
CPU times: user 1.92 s, sys: 540 ms, total: 2.46 s
Wall time: 2.47 s
