### Create Sample Datasets
> When "training" and "testing" datasets used to run neuron networks are large, it is good practice to test the model first by using smaller batch of data, so to ensure that your pipeline are working correctly.  I call these smaller batches of data "sample training" and "sample testing" datasets.  These smaller datasets are random copies of the larger training and testing datasets.  This notebook contains the code for the creation of the sample datasets. 

As an example, I will use the "Street View House Number" datasets, which is already downloaded in my computer:<br> 
1) Print current directory to find out how folders are structured.<br>
2) Create paths variables for the following folders: "train", "test", "sample/train",    "sample/validation", and "sample/test".<br>
3) Create directories for: "sample/train" and "sample/validation", if they do not exist. <br>
4) Find the number of the images in "train" dataset.<br>
5) Randomly select 120 images from "train" dataset and copy them to "sample/train" folder.  Randomly select 20 images from "sample/train" dataset and "move" these 20 images to "sample/validation" dataset. Finally, check if the images are correctly located in "sample/train" and "sample/validation".<br>
6) Repeat steps 3 to 5 to create "sample/test" dataset.<br>

In [19]:
# Import requirement.
from shutil import copyfile
from pathlib import Path
import random
import collections
import os
import glob
import pandas as pd
import json

Step 1: Print data dir.

In [2]:
# dir data/svhn is where SVHN datasets and related files are located.
PATH = Path('data/svhn')
list(PATH.iterdir())
# "sample" folder not available.

[PosixPath('data/svhn/sample'),
 PosixPath('data/svhn/models'),
 PosixPath('data/svhn/train'),
 PosixPath('data/svhn/test.json'),
 PosixPath('data/svhn/testing.json'),
 PosixPath('data/svhn/validation'),
 PosixPath('data/svhn/svhn_dataextract_tojson.py'),
 PosixPath('data/svhn/train.json'),
 PosixPath('data/svhn/test'),
 PosixPath('data/svhn/digitStruct.json'),
 PosixPath('data/svhn/digitStruct.mat'),
 PosixPath('data/svhn/extra')]

Step 2: Build directories, create paths var.

In [22]:
# type: pathlib.PosixPath
PATH_TRA = Path(PATH/'train') # Create path->'svhn/train' folder.
PATH_TES = Path(PATH/'test') # Create path->'svhn/test' folder.
P_SAMPLE = Path(PATH/'sample') # Create path->'svhn/sample' folder.

Step 3: Create dirs: "sample/train", "sample/validation", and "sample/test".

In [4]:
# Create path->'sample/train' folder.
P_SAMPLE_TRA = P_SAMPLE/'train'
P_SAMPLE_TRA.mkdir(parents = True, exist_ok = True)
# Create path->'sample/validation' folder.
P_SAMPLE_VAL = P_SAMPLE/'validation'
P_SAMPLE_VAL.mkdir(parents = True, exist_ok = True)
# Use Path.exists() to confirm if folder has been created.
# P_SAMPLE_TRA.exists() # True

Step 4: Find total number of images in "svhn/train" folder.    

In [5]:
fn_sample = []
for i in PATH_TRA.rglob('*.png'):
    names = i.name
    fn_sample.append(names)
print('Total number of images in folder "train" is {}.'.format(len(fn_sample)))    

Total number of images in folder "train" is 33402.


Step 5: Randomly select 120 images, randomly copy 100 to "sample/train" and 20 for "sample/validation".

In [6]:
# Randomly select 120 images and copy them to sample/train or P_SAMPLE_TRA
random.shuffle(fn_sample)
fn_sample_train = fn_sample[0:120]
# Notes:
# fn_sample_train # type list
# fn_sample_train[0] # type str

In [7]:
# Copy 120 images to P_SAMPLE_TRA
for i in range(len(fn_sample_train)):
    copyfile(str(PATH_TRA/fn_sample_train[i]), str(P_SAMPLE_TRA/fn_sample_train[i]))
    
# Note:
# type pathlib.PosixPath should be converted to str, so that it can be used in "copyfile()"."

In [8]:
# Finding the qty of files in P_SAMPLE_TRA.
collections.Counter(p.suffix for p in P_SAMPLE_TRA.rglob('*.png*'))

Counter({'.png': 220})

In [9]:
# Randomly move 20 files from P_SAMPLE_TRA to P_SAMPLE_VAL.
random.shuffle(fn_sample_train)
fn_sample_val = fn_sample_train[0:20]

for i in range(len(fn_sample_val)):
    os.rename(str(P_SAMPLE_TRA/fn_sample_val[i]), str(P_SAMPLE_VAL/fn_sample_val[i]))

In [10]:
# Verify if 100 images are in P_SAMPLE_TRA and 20 images are in P_SAMPLE_VAL. 
val = collections.Counter(p.suffix for p in P_SAMPLE_VAL.rglob('*.png*'))
train = collections.Counter(p.suffix for p in P_SAMPLE_TRA.rglob('*.png*')) 
print('There are {} images in "sample/validation" folder.'.format(val['.png']))
print('There are {} images in "sample/train" folder'.format(train['.png']))

There are 40 images in "sample/validation" folder.
There are 200 images in "sample/train" folder


Step 6: Repeat steps 3 and 5, using "svhn/test" as  and "svhn/sample/test".

In [11]:
# Create path->'sample/test' folder.
P_SAMPLE_TES = P_SAMPLE/'test'
P_SAMPLE_TES.mkdir(parents = True, exist_ok = True)

In [12]:
# Find total qty of images in "svhn/test" folder.
fn_sample_1 = []
for i in PATH_TES.rglob('*.png'):
    names = i.name
    fn_sample_1.append(names)

print('Total number of images in folder "test" is {}.'.format(len(fn_sample_1)))    

Total number of images in folder "test" is 13068.


In [13]:
# Randomly select 50 images and copy them to sample/test or P_SAMPLE_Tes
random.shuffle(fn_sample_1)
fn_sample_test = fn_sample_1[:50]

for i in range(len(fn_sample_test)):
    copyfile(str(PATH_TES/fn_sample_test[i]), str(P_SAMPLE_TES/fn_sample_test[i]))

In [14]:
# Finding out the qty of images in P_SAMPLE_TES.
test = collections.Counter(p.suffix for p in P_SAMPLE_TES.rglob('*.png*'))
print('There are {} images in "sample/test" folder.'.format(test['.png']))

There are 100 images in "sample/test" folder.


### End