### Create CSV Files for Data Inputs
>There are 2 methods to prepare your folders for image classification:<br> a) Organize your folders into "train", "validation", "test" folders.  In "train" and "validation" folders, add the different categories to be classified. For ex., if the goal is to classify dolphin and sharks, then create subfolders named "dolphin" and "sharks" in "train" and "validation" folders.<br> <br>b) Suppose you have 20 categories to classify instead of 2 categories. It is rather burdensome to create 20 folders with images for each category.  To ease the process, we can just create a file with CSV format.  The CSV file contains data of the "filenames" or "image_names" and the "labels".<br><br>
The codes presented in this notebook, creates CSV files for data inputs.<br><br>
In fast.ai library, to input data using process 'a)', use this method:<br> * **ImageClassifierData.from_paths()**. <br>
On the hand, to input data using process 'b)', use this method:<br> * **ImageClassifierData.from_csv()**.






In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.conv_learner import *

In [3]:
# svhn data location.
PATH = Path('data/svhn') # Path(), pathlib lib
# sample data location 
PATH_S = Path(PATH/'sample') # 'svhn/sample'
S_TRAIN = Path(PATH_S/'train') # 'sample/train'
S_VAL = Path(PATH_S/'validation') # 'sample/validation'
S_TEST = Path(PATH_S/'test') # 'sample/test'
TRAIN_JSON = Path(PATH/'train.json') # 'svhn/train.json'
TEST_JSON = Path(PATH/'test.json') # 'svhn/test.json'
list(PATH.iterdir()) # iterdir(), pathlib lib

[PosixPath('data/svhn/sample'),
 PosixPath('data/svhn/train.csv'),
 PosixPath('data/svhn/sample_train.csv'),
 PosixPath('data/svhn/models'),
 PosixPath('data/svhn/train'),
 PosixPath('data/svhn/test.json'),
 PosixPath('data/svhn/testing.json'),
 PosixPath('data/svhn/validation'),
 PosixPath('data/svhn/sample_test.csv'),
 PosixPath('data/svhn/svhn_dataextract_tojson.py'),
 PosixPath('data/svhn/train.json'),
 PosixPath('data/svhn/test'),
 PosixPath('data/svhn/digitStruct.json'),
 PosixPath('data/svhn/digitStruct.mat'),
 PosixPath('data/svhn/extra'),
 PosixPath('data/svhn/test.csv')]

2. Open "train.json" and "test.json". This 2 files contains data, 'filename', 'label' values, 'bbox' dimensions for all the images in SVHN dataset.  I will extra them so that they can be used later to create CSV files. 

In [4]:
# Open "train.json" and "test.json" file. 
def open_json(path):
    with open(path) as f:
        data = json.load(f) # type list
    return data

In [8]:
# What's inside "train.json"?
train = open_json(TRAIN_JSON);train[0] # type list

{'filename': '1.png',
 'boxes': [{'height': 219.0,
   'label': 1.0,
   'left': 246.0,
   'top': 77.0,
   'width': 81.0},
  {'height': 219.0, 'label': 9.0, 'left': 323.0, 'top': 81.0, 'width': 96.0}]}

In [9]:
# What's inside "test.json"?
test = open_json(TEST_JSON);test[0]

{'filename': '1.png',
 'boxes': [{'height': 30.0,
   'label': 5.0,
   'left': 43.0,
   'top': 7.0,
   'width': 19.0}]}

In [10]:
# "train.json"
# Extract first label value, '1.0' from key 'label', there are 2 values:
label_1_train = train[0]['boxes'][0]['label'];
# Extract the second label value, '9.0':
label_2_train = train[0]['boxes'][1]['label'];
# Extract value of the key 'filename':
filename_train = train[0]['filename'][0];
print('Train -> filename: {}, first label: {}, second label: {}'.format(filename_train[0], label_1_train, label_2_train))

Train -> filename: 1, first label: 1.0, second label: 9.0


In [11]:
# "test.json"
# Extract first lable value, '5.0' from key 'label', there is only 1 value:
label_1_test = test[0]['boxes'][0]['label'];
# Extract values for key 'filename':
filename_test = test[0]['filename'][0];
print('Test -> filename: {}, first label: {}'.format(filename_test[0], label_1_test))

Test -> filename: 1, first label: 5.0


In [12]:
# Build 'train.csv'
# Create a list in this format ['filename', 'concatenate label_1 + ... + label_n'].
result_train = []
for full in train:
    file_r = []
    f_name = (full['filename'])
    file_r.append(f_name)
    box_r = []
    for full_2 in (full['boxes']):
        b_name = full_2['label']
        box_r.append(str(int(b_name)))
    label_real = (''.join(box_r))
    file_r.append(label_real)
    result_train.append(file_r)
print(result_train[0:10]) # print first 10 results

[['1.png', '19'], ['2.png', '23'], ['3.png', '25'], ['4.png', '93'], ['5.png', '31'], ['6.png', '33'], ['7.png', '28'], ['8.png', '744'], ['9.png', '128'], ['10.png', '16']]


In [13]:
# Create a CSV file with "filename" and "label" columns next to each other.
columns = ['filename', 'label']
df_train = pd.DataFrame.from_records(result_train, columns = columns); df_train.head()

Unnamed: 0,filename,label
0,1.png,19
1,2.png,23
2,3.png,25
3,4.png,93
4,5.png,31


In [14]:
# Remove ".png" from "filename":
df_train = pd.DataFrame.from_records(result_train, columns = columns)
df_train['filename'] = df_train.filename.str.extract('(\d+)', expand = True).astype(int)
df_train.head()

Unnamed: 0,filename,label
0,1,19
1,2,23
2,3,25
3,4,93
4,5,31


In [15]:
# Convert DataFrame to CSV, save it in 'data/svhn':
df_train.to_csv(PATH/'train.csv')

In [16]:
# Build 'test.csv'
# Create a list in this format ['filename', 'concatenate label_1 + ... + label_n'].
result_test = []
for full in test:
    file_r = []
    f_name = (full['filename'])
    file_r.append(f_name)
    box_r = []
    for full_2 in (full['boxes']):
        b_name = full_2['label']
        box_r.append(str(int(b_name)))
    label_real = (''.join(box_r))
    file_r.append(label_real)
    result_test.append(file_r)
print(result_test[0:10]) # print first 10 results

[['1.png', '5'], ['2.png', '2110'], ['3.png', '6'], ['4.png', '1'], ['5.png', '9'], ['6.png', '1'], ['7.png', '183'], ['8.png', '65'], ['9.png', '144'], ['10.png', '16']]


In [17]:
# Create a CSV file with "filename" and "label" columns next to each other.
columns = ['filename', 'label']
df = pd.DataFrame.from_records(result_test, columns = columns); df.head()

Unnamed: 0,filename,label
0,1.png,5
1,2.png,2110
2,3.png,6
3,4.png,1
4,5.png,9


In [18]:
# Remove ".png" from "filename":
df_test = pd.DataFrame.from_records(result_test, columns = columns)
df_test['filename'] = df_test.filename.str.extract('(\d+)', expand = True).astype(int)
df_test.head()

Unnamed: 0,filename,label
0,1,5
1,2,2110
2,3,6
3,4,1
4,5,9


In [19]:
# Convert DataFrame to CSV, save it in 'data/svhn':
df_test.to_csv(PATH/'test.csv')

3. From the data obtained in 'train.csv ', we create CSV files for 'sample/sample_train.csv' and 'sample/sample_validation.csv'.   From 'test.csv', we create CSV file for 'sample/sample_test.csv".   These 3 files, 'sample_train.csv', 'sample_validation.csv', and 'sample_test.csv' will be used as inputs when we use sample datasets.

In [20]:
# Function 'folder_inf()' will print path and image quantity.

def folder_inf(folder, formato):
    file_names = []
    for i in folder.rglob(formato):
        s = i.name
        num = int(re.findall(r'\b\d+\b', s)[0]) # remove ".png" characters
        file_names.append(num)
    print('Total number of images(type {}) in folder "{}" is "{}".'.format(formato, folder,
                                                                           len(file_names)))
    return(file_names)


In [21]:
# Var 'sample_train_csv' contains folder path, file quantities and filenames(w/out '.png':
sample_train_csv = folder_inf(S_TRAIN, "*png")
sample_train_csv[0:10]

Total number of images(type *png) in folder "data/svhn/sample/train" is "100".


[32934, 14893, 11264, 31961, 21096, 7297, 5827, 24429, 12173, 2984]

In [22]:
# Var 'sample_validation_csv' contains folder path, file quantities and filenames(w/out '.png')
sample_validation_csv = folder_inf(S_VAL, "*png")
sample_validation_csv[0:10]

Total number of images(type *png) in folder "data/svhn/sample/validation" is "20".


[21505, 12870, 31452, 4044, 13839, 11411, 22649, 23695, 30651, 12558]

In [23]:
# Var 'sample_test_csv' contains folder path, file quantities and filenames(w/out '.png')
sample_test_csv = folder_inf(S_TEST, "*png")
sample_test_csv[0:10]

Total number of images(type *png) in folder "data/svhn/sample/test" is "50".


[4844, 2907, 8390, 11314, 8860, 9381, 5397, 3217, 7952, 2619]

In [24]:
# We will extract information from 'train.csv' and 'test.csv' to create 'sample_train.csv', 
#   'sample_validation.csv' and'sample_test.csv'
read_train = pd.read_csv(PATH/'train.csv') # read data in 'train.csv'
read_test = pd.read_csv(PATH/'test.csv') # read data in 'test.csv'

In [25]:
# Create 'sample_train.csv'
sample_train_file = (read_train.take(sample_train_csv))
df_s_train = pd.DataFrame(sample_train_file)
df_s_train = df_s_train.loc[:, ~df_s_train.columns.str.contains('^Unnamed')]
df_s_train.head()

Unnamed: 0,filename,label
32934,32935,94
14893,14894,4
11264,11265,342
31961,31962,5
21096,21097,38


In [26]:
# Sorting in ascending order.
df_s_train = df_s_train.sort_values(by = 'filename', ascending = True)
# Save 'sample_train.csv'.
df_s_train.to_csv(PATH/'sample_train.csv')
df_s_train.head()

Unnamed: 0,filename,label
3,4,93
259,260,36
473,474,2367
1098,1099,31
1796,1797,44


In [27]:
# Create 'sample_validation.csv'
sample_validation_file = (read_train.take(sample_validation_csv))
df_s_val = pd.DataFrame(sample_validation_file)
df_s_val = df_s_val.loc[:, ~df_s_val.columns.str.contains('^Unnamed')]
df_s_val.head()

Unnamed: 0,filename,label
21505,21506,18
12870,12871,1
31452,31453,79
4044,4045,33
13839,13840,3


In [28]:
# Sorting in ascending order.
df_s_val = df_s_val.sort_values(by = 'filename', ascending = True)
# Saving 'sample_val.csv'
df_s_val.to_csv(PATH/'sample_val.csv')
df_s_val.head()

Unnamed: 0,filename,label
4044,4045,33
6148,6149,14
10981,10982,78
11010,11011,3
11411,11412,45


In [29]:
# Create 'sample_test.csv'
sample_test_file = (read_test.take(sample_test_csv))
df_s_test = pd.DataFrame(sample_test_file)
df_s_test = df_s_test.loc[:, ~df_s_test.columns.str.contains('^Unnamed')]
df_s_test.head()

Unnamed: 0,filename,label
4844,4845,19
2907,2908,27
8390,8391,1108
11314,11315,91
8860,8861,81


In [30]:
# Sorting in ascending order
df_s_test = df_s_test.sort_values(by = 'filename', ascending = True)
# Saving 'sample_test.csv'
df_s_test.to_csv(PATH/'sample_test.csv')
df_s_test.head()

Unnamed: 0,filename,label
10,11,34
323,324,55
356,357,127
1164,1165,8101
1890,1891,2


### End