# Splitting train, validation and test sets

-    After collecting dataset images for classes with_mask and without_mask; and before training the YOLO network we need to partition the dataset in order to obtain train, validation and test sets. The first idea would be shuffle the dataset and then assign a random partition to the desired sets. Instead of that, we need to understand that we are going to train our detector using 4 different sources of data (Kaggle,RMDF,Pascal and scraped images). Not all images come from the same distribution, for example, some datasets have a bigger amount of high quality images, taken by professional cameras for example, other datasets have noisy images where people's faces are small, low quality, etc.

-    The default path for storing train, validation and test sets is: **'data/voc2012_raw/VOCdevkit/VOC2012/ImageSets/Main'**.

- We will partition our data in approximately a 70% train - 30% validation split. The total number of images in our dataset is currently 2749. In order to have better generalization we would need to collect much more labeled images for  our classes.

- The following steps are the ones I performed in order to obtain the final train, test and validation partitions. Different dataset partitions were used, but they resulted in messy inaccurate predictions. Final split files are stored on the data/voc2012_raw/VOCdevkit/VOC2012/ImageSets/Main directory.

# Reading dataset samples

We read the filenames from all samples and store them on different lists depending on their class or dataset:

In [7]:
#Path for all images from our dataset
my_dataset_and_voc_path = 'data/voc2012_raw/VOCdevkit/VOC2012/JPEGImages'

In [8]:
mask_samples_internet_and_rmdf = [] #samples from RMDF dataset and scraped data
mask_samples_kaggle = [] #samples from kaggle dataset
no_mask_samples_pascal =  [] #samples from PASCAL dataset


#Reading all image filenames on the my_dataset_and_voc_path directory
for filename in os.listdir(my_dataset_and_voc_path):
    name = filename.split('.')[0]
    if name[:4]=='2020': #Filenames from  RMDF dataset and scraped data
        mask_samples_internet_and_rmdf.append(name)
    elif name[:5]=='makss': #Filenames from kaggle dataset
        mask_samples_kaggle.append(name)
    else: #samples from PASCAL dataset starting with 2008 or 2007
        no_mask_samples_pascal.append(name)

In [9]:
len(no_mask_samples_pascal), len(mask_samples_internet_and_rmdf), len(mask_samples_kaggle)

(1017, 879, 853)

Randomly shuffle samples on every list:

In [10]:
random.shuffle(no_mask_samples_pascal)
random.shuffle(mask_samples_internet_and_rmdf)
random.shuffle(mask_samples_kaggle)

# Splitting PASCAL VOC dataset

In [1]:
import os
import random

- First, we are going to split our subset of images from the PASCAL VOC dataset contained on the **no_mask_samples_pascal** array. As a reminder, the PASCAL VOC dataset have images for several object classes, but we picked a small subset containing the 'person' class to adapt for our problem. Each class in the PASCAL dataset comes with txt files that contain the partitions for training and validation. 


- We will first read and store all validation samples from the 'person' class that came on the original dataset, and then intersect this list with our subset. That way we will use the same validation samples indicated by the Pascal dataset.

In [2]:
#Extracting images from person validation file

In [3]:
person_val_file = 'data/voc2012_raw/VOCdevkit/VOC2012/ImageSets/Main/person_val.txt'

In [6]:
person_val = open(person_val_file,"r")
lines = person_val.readlines()
lines = [ line.split() for line in lines]

#Storing all validation samples for our subset as a list
val_samples_person = []

for line in lines:
    if line[1]=='1':
        val_samples_person.append(line[0])
        
print("Printing first 5 validation samples: ",val_samples_person[:5])        


Printing first 5 validation samples:  ['2008_000003', '2008_000026', '2008_000032', '2008_000034', '2008_000051']


Finding intersection between our PASCAL samples and validation samples:

In [8]:
train = []
val = []

for sample in no_mask_samples_pascal:
    #If a sample in our subset is present on the original validation txt, add it to our list 
    if sample in val_samples_person:
        val.append(sample)
    else:
        train.append(sample)

In [9]:
len(val), len(train)

(536, 481)

Approximately 30% of our PASCAL samples will be used for validation (306), the rest for training(701) and test(10) samples:

In [10]:
val_pascal = val[:306]
train_pascal = val[306:] + train[:-10]
test_pascal = train[-10:]

In [11]:
len(val_pascal), len(train_pascal), len(test_pascal)

(306, 701, 10)

# Splitting RMDF and scraped dataset

Now we will split samples from RMDF and scraped datasets. There are a total of 879 samples. We are using almost 80% samples for training (700) and the rest for validation (165) and test(14):

In [78]:
train_internet_and_rmdf = mask_samples_internet_and_rmdf[:700]
val_internet_and_rmdf = mask_samples_internet_and_rmdf[700:700+165]
test_internet_and_rmdf = mask_samples_internet_and_rmdf[700+165:]

# Splitting Kaggle dataset

For the Kaggle dataset approximately 80%(700) will be used for training and the rest for validation (139) and testing (14):

In [None]:
train_kaggle = mask_samples_kaggle[:700]
val_kaggle = mask_samples_kaggle[700:700+139]
test_kaggle = mask_samples_kaggle[500+338:]


In [79]:
len(train_pascal),len(train_internet_and_rmdf),len(train_kaggle)

(701, 700, 500)

In [80]:
len(val_pascal),len(val_internet_and_rmdf),len(val_kaggle)

(306, 165, 338)

In [81]:
len(test_pascal),len(test_internet_and_rmdf),len(test_kaggle)

(10, 14, 15)

# Final dataset splits

Now are merging training, validation and test samples from every dataset to obtain our final splits:

In [83]:
samples_train = train_pascal + train_internet_and_rmdf + train_kaggle
samples_val = val_pascal + val_internet_and_rmdf + val_kaggle
samples_test = test_pascal + test_internet_and_rmdf + test_kaggle

In [84]:
samples_train[:5]

['2008_000278', '2008_002042', '2008_002069', '2008_001356', '2008_001275']

Randomly shuffle all samples:

In [85]:
random.shuffle(samples_train)
random.shuffle(samples_val)
random.shuffle(samples_test)

In [86]:
samples_train[:5]

['2008_002718',
 '2020_02_285',
 '2020_112',
 'maksssksksss758',
 'maksssksksss389']

These are the final split lengths:

In [87]:
len(samples_train),len(samples_val),len(samples_test)

(1901, 809, 39)

Finally, given our splits we will store the corresponding txt files that will be used for training the model. The model doesn't read information directly from XML files, instead it reads information from TF.RECORD files.

In [82]:
#Path for storing txt files with the dataset splits
main = 'data/voc2012_raw/VOCdevkit/VOC2012/ImageSets/Main/'

In [88]:
#Writing train, validation and test files

train = open(main + 'train.txt',"w")
val = open(main + 'val.txt',"w")
test = open(main + 'test.txt',"w")

label = 1
for i, name in enumerate(samples_train):
    train.write(name+' '+str(label))
    if i < (len(samples_train) - 1):
        train.write('\n')
    
for i, name in enumerate(samples_val):
    val.write(name+' '+str(label))
    if i < (len(samples_val) - 1):
        val.write('\n')
        
for i, name in enumerate(samples_test):
    test.write(name+' '+str(label))
    if i < (len(samples_test) - 1):
        test.write('\n')
        
train.close() 
val.close()
test.close()

# Generating tf.record train and validation files

Yolo pre-trained network comes with tools for generating tf.records from our train.txt, val.txt and test.txt files. For generating them, we need to call the following commands:

In [None]:
python tools/voc2012.py \
  --data_dir './data/voc2012_raw/VOCdevkit/VOC2012' \
  --split train \
  --output_file ./data/voc2012_train.tfrecord

In [None]:
python tools/voc2012.py \
  --data_dir './data/voc2012_raw/VOCdevkit/VOC2012' \
  --split val \
  --output_file ./data/voc2012_val.tfrecord