# Splitting the dataset and generating required files

### This notebook will focus on spliting the dataset into training and testing sets, and the generation of files train.txt and test.txt required for training the object detector

In [1]:
#import libraries
import glob
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In the same manner csv files were created for GTSDB and MakeML datasets, a similar file will be generated for the collected images set using the annotations made. The main purpose of this is to help visualizing data distribution

Define the path to annotation and create csv file and its header

In [27]:
ann_dir = "collected_images/annotations/"
csv_file = open("gt_collected_yolo.csv", "w")
csv_file.write("file,xcenter,ycenter,width,height,class\n")

40

Read all the annotations and add information to csv

In [28]:
for filename in glob.glob(ann_dir + "*.txt"):
    ann = open(filename)
    
    basename = os.path.basename(filename)
    basename = os.path.splitext(basename)[0]
    
    for line in ann:
        #format for csv
        line = line.split(" ")
        cls = line[0]
        xc = line[1]
        yc = line[2]
        w = line[3]
        h = float(line[4])
        new_line = basename + ".jpg," + xc + "," + yc + "," + w + "," + str(h) + "," + cls + "\n"
        csv_file.write(new_line)
        
csv_file.close()

Now that all the information about the dataset is stored in csv file, let's visualize it before proceeding any further

Read the csv files as dataframes and present the first rows of each one

In [3]:
df_gtsdb = pd.read_csv("gt_GTSDB_yolo.csv")
df_makeml = pd.read_csv("gt_MakeML_yolo.csv")
df_collected = pd.read_csv("gt_collected_yolo.csv")

In [4]:
df_gtsdb.head()

Unnamed: 0,file,xcenter,ycenter,width,height,class
0,00000.jpg,0.584191,0.535625,0.030147,0.04375,4
1,00001.jpg,0.737868,0.5125,0.030147,0.055,5
2,00001.jpg,0.304412,0.65375,0.041176,0.0725,5
3,00001.jpg,0.736765,0.453125,0.042647,0.06875,7
4,00002.jpg,0.697794,0.6675,0.083824,0.145,5


In [5]:
df_makeml.head()

Unnamed: 0,file,xcenter,ycenter,width,height,class
0,road0.png,0.573034,0.3675,0.411985,0.425,0
1,road1.png,0.515,0.607774,0.26,0.770318,0
2,road10.png,0.4375,0.498127,0.345,0.973783,0
3,road100.png,0.4975,0.42987,0.82,0.833766,2
4,road101.png,0.73375,0.5025,0.4925,0.935,2


In [6]:
df_collected.head()

Unnamed: 0,file,xcenter,ycenter,width,height,class
0,img1.jpg,0.353805,0.362483,0.190921,0.127281,0
1,img1.jpg,0.354306,0.471851,0.131175,0.08567,5
2,img10.jpg,0.428883,0.512039,0.441298,0.606905,1
3,img11.jpg,0.501013,0.380305,0.440683,0.309992,1
4,img12.jpg,0.729427,0.393052,0.275521,0.511319,1


Create plots to visualize data distribution before merging the sets and before augmentation

In [None]:
cls_distribution = np.array(df_gtsdb['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - GTSDB")
plt.show()

In [None]:
cls_distribution = np.array(df_makeml['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - MakeML")
plt.show()

In [None]:
cls_distribution = np.array(df_collected['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - Collected Images")
plt.show()

And let's see how the dataset looked before adding extra images, but after applying the reclassification as described in the report

In [None]:
df = pd.concat([df_gtsdb, df_makeml], ignore_index=True)

cls_distribution = np.array(df['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - GTSDB + MakeML")
plt.show()

It is obviously that the set is unbalanced. However, it would not be feasible to gather and annotate so many images for each class to make the set balanced (considering the time limitations). Therefore, another approch was taken: work on balancing all the classes except class 2 (hence the collected_images set) which requires far less images, and the augmentation performed on all classes but class 2. Let's see the final set data distribution across classes

In [None]:
final = pd.concat([df_gtsdb, df_makeml, df_collected], ignore_index=True)

cls_distribution = np.array(df['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - Final Dataset")
plt.show()

And also see the number of annotated images available in the set

In [10]:
unique_files = final['file'].unique().tolist()
print("Number of images: ", len(unique_files))

Number of images:  1672


Proceeding onto spliting the set into training and testing with a ratio of 60:40

In [None]:
#use this cell to install sklearn if needed
#!pip install sklearn

In [11]:
from sklearn.model_selection import train_test_split

Get the unique filenames of the images

In [12]:
aux = final[['file']]
aux = aux.drop_duplicates()

Perform the actual split and show the size of the subsets. Note that the results of splitting are different for each individual run, so the files may differ, but the number of images should be consistent

In [13]:
train, test = train_test_split(aux, test_size=0.4)
print("Images for training: ", train.shape[0])
print("Images for testing: ", test.shape[0])

Images for training:  1003
Images for testing:  669


Having the subsets done, extract the information about each image from the 'final' dataframe for the training and respectively testing set. Plot the data distribution.

In [21]:
train_df = pd.DataFrame({'file':[], 'xcenter':[], 'ycenter':[], 'width':[], 'height':[], 'class':[]})
for file in train['file']:
    train_df = train_df.append(final[final['file'] == file], ignore_index=True)
    
train_df.head(15)

Unnamed: 0,file,xcenter,ycenter,width,height,class
0,road274.png,0.641667,0.7525,0.183333,0.145,2.0
1,road770.png,0.515,0.4475,0.11,0.075,2.0
2,00653.jpg,0.586029,0.536875,0.016176,0.02875,5.0
3,00653.jpg,0.584926,0.494375,0.022794,0.03875,7.0
4,00430.jpg,0.712132,0.56375,0.058088,0.0925,7.0
5,road52.png,0.48,0.66625,0.68,0.5025,1.0
6,road28.png,0.44875,0.078603,0.0425,0.139738,0.0
7,road28.png,0.705,0.320961,0.045,0.144105,0.0
8,road28.png,0.665,0.329694,0.035,0.135371,0.0
9,00881.jpg,0.845221,0.724375,0.027206,0.04625,2.0


In [22]:
test_df = pd.DataFrame({'file':[], 'xcenter':[], 'ycenter':[], 'width':[], 'height':[], 'class':[]})
for file in test['file']:
    test_df = test_df.append(final[final['file'] == file], ignore_index=True)
    
test_df.head(15)

Unnamed: 0,file,xcenter,ycenter,width,height,class
0,road417.png,0.478333,0.49125,0.116667,0.0875,2.0
1,road636.png,0.285,0.15375,0.223333,0.1625,1.0
2,00179.jpg,0.512132,0.538125,0.015441,0.02375,7.0
3,road353.png,0.441667,0.4925,0.196667,0.155,2.0
4,road360.png,0.323333,0.515,0.146667,0.11,2.0
5,00373.jpg,0.131985,0.665625,0.022794,0.03875,2.0
6,00373.jpg,0.614706,0.66,0.025,0.0425,2.0
7,00443.jpg,0.611765,0.705,0.022059,0.035,4.0
8,00443.jpg,0.6125,0.735625,0.014706,0.02375,2.0
9,00443.jpg,0.357353,0.705625,0.023529,0.03375,4.0


In [None]:
cls_distribution = np.array(train_df['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - Train Set (Not Augmented)")
plt.show()

In [None]:
cls_distribution = np.array(test_df['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - Test Set")
plt.show()

Save this information to a csv file (if ever needed any further)

In [19]:
train_df.to_csv("gt_train_not_augmented.csv", index=False)
test_df.to_csv("gt_test.csv", index=False)

Augmentation of all classes except speed limits

In [56]:
to_augment = train_df[train_df['class'] != 2]
to_augment = to_augment.reset_index(drop=True)
aug_files = to_augment['file'].unique().tolist()

In [57]:
aug = pd.DataFrame({'file':[], 'xcenter':[], 'ycenter':[], 'width':[], 'height':[], 'class':[]})

In [60]:
index = 0
for file in aug_files:
    index_list = train_df[train_df['file']==file].index
    for i in index_list:
        filename = train_df['file'][i]
        base = filename.split(".")[0]
        ext = filename.split(".")[1]
        dark = base + "_dark." + ext
        bright = base + "_bright." + ext
        blur = base + "_blur." + ext
        blur2 = base + "_blur2." + ext
        aug.loc[index*4] = [dark, train_df['xcenter'][i], train_df['ycenter'][i], train_df['width'][i], train_df['height'][i], train_df['class'][i]]
        aug.loc[index*4 + 1] = [bright, train_df['xcenter'][i], train_df['ycenter'][i], train_df['width'][i], train_df['height'][i], train_df['class'][i]]
        aug.loc[index*4 + 2] = [blur, train_df['xcenter'][i], train_df['ycenter'][i], train_df['width'][i], train_df['height'][i], train_df['class'][i]]
        aug.loc[index*4 + 3] = [blur2, train_df['xcenter'][i], train_df['ycenter'][i], train_df['width'][i], train_df['height'][i], train_df['class'][i]]
        index += 1

Save the final training set to csv file (if ever needed any further)

In [64]:
final_train = pd.concat([aug, train_df], ignore_index=True)
final_train.to_csv("gt_train_augmented.csv", index=False)

Plot the distribution upon augmentation

In [None]:
cls_distribution = np.array(final_train['class'])
plt.hist(cls_distribution, bins = 8)
plt.title("Class distribution - Train Set (Augmented)")
plt.show()

### Generate train.txt and test.txt files

Specify the path to the directory of the files (assuming all images will be stored in the same folder)

In [65]:
images_dir = 'data/images/'

For each unique file in final_train and test_df dataframes add the corresponding line to train.txt and test.tx respectively

In [66]:
train_files = final_train['file'].unique().tolist()
test_files = test_df['file'].unique().tolist()

train_out = open("train.txt", "w")
test_out = open("test.txt", "w")

for file in train_files:
    train_out.write(images_dir+file+"\n")
    
for file in test_files:
    test_out.write(images_dir+file+"\n")
    
train_out.close()
test_out.close()