## End-to-End Procedure: Data cleaning

### Procedure Outline
1. Filter the dataset ***<--- This notebook***
    - Detect faces among all the images. Reject images that have more than one face, or no face.
    - Save filtered dataset to new location.
2. Generate Train-Test Splits ***<--- This notebook***
    - Create folds.
3. Evaluate 
    - Generate embeddings from the splits
    - Train classifier on the embeddings
    - Test classifier on the embeddings
4. Tune classifier
    - Tune the classifier 
5. Save the model

### Imports

In [18]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [19]:
import os
import cv2
import pprint
import logging
import tqdm
import math

import face_trigger

In [20]:
from face_trigger.model.deep.FaceRecognizer import FaceRecognizer
from face_trigger.process.post_process import FaceDetector, LandmarkDetector, FaceAlign
from face_trigger.utils.common import RepeatedTimer, clamp_rectangle
from face_trigger.utils.data import dataset_filter, Dataset, get_jittered_images

In [21]:
unnormalized_dataset_path = "/media/ankurrc/new_volume/softura/facerec/datasets/softura_emp"
dataset_path = "/media/ankurrc/new_volume/softura/facerec/softura_filtered"
split_path = "/media/ankurrc/new_volume/softura/facerec/softura_split_path"

In [22]:
logging.basicConfig(level=logging.DEBUG)

### Filter dataset
 **Note:** While filtering the dataset we assume that the original dataset has the following structure:
 1. At the root level there are directories that represent each personality. The directories may or may not have a numeric name.
 2. Within each directory, the files should represent the images that contain the parent directory's(personality) faces. The file names may or may not be numeric.


In [23]:
def filter_the_dataset(unnormalized_dataset_path=None, dataset_path=None):
    return dataset_filter(dataset_path=unnormalized_dataset_path, output_path=dataset_path)

In [24]:
rejected_dirs = filter_the_dataset(unnormalized_dataset_path=unnormalized_dataset_path, dataset_path=dataset_path)

100%|██████████| 40/40 [01:51<00:00,  2.06s/it]
INFO:face_trigger.utils.data:Filtered dataset created at /media/ankurrc/new_volume/softura/facerec/softura_filtered


Rejected files:
{'20': [{'img': 'IMG_0093.jpg', 'reason': 'faces detected: 2'},
        {'img': 'IMG_0095.jpg', 'reason': 'faces detected: 2'}]}


In [25]:
count_dict = {}
for root, dirs, files in os.walk(dataset_path):
    
    for direc in dirs:
            person_id = direc
            count_dict[person_id] = None
                    
    if root != dataset_path:
        count_dict[os.path.split(root)[1]] = len(files)

In [26]:
empty_dir = []
for key, val in count_dict.iteritems():
    if val == 0:
        directory = os.path.join(dataset_path, key)
        os.rmdir(directory)
        empty_dir.append(key)

print("The following directories were empty, and were removed: {}".format(empty_dir))

for direc in empty_dir:
    count_dict.pop(direc)

The following directories were empty, and were removed: []


### Generate Splits (Optional)
**Note:** Generating manual splits is optional. In the training section, we use sklearn's cv.
  If you decide to use manual-splitting, you need to change the training routine as well.

In [11]:
def generate_splits(dataset_path=None, split_path=None):
    dataset = Dataset(dataset_path=dataset_path,
                      split_path=split_path)
    folds = 3
    training_samples = [2, 5, 8]
    
    dataset.split(num_train_list=training_samples, folds=folds)

In [12]:
generate_splits(dataset_path=dataset_path, split_path=split_path)

INFO:face_trigger.utils.data:Generating for 2 training samples per subject.
INFO:face_trigger.utils.data:Generating: Fold 1
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/1
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/1/train.csv
INFO:face_trigger.utils.data:Generating: Fold 2
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/2
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/2/train.csv
INFO:face_trigger.utils.data:Generating: Fold 3
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/3
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/3/train.csv
INFO:face_trigger.utils.data:We have 40 subje

### Augment dataset by jittering

In [27]:
# minimum samples for each class
k_core = 10

In [28]:
# figure out which classes have a count less than k_core
minority_classes = []
for key, val in count_dict.items():
    if val < k_core:
        minority_classes.append(key)

In [29]:
minority_classes

['24',
 '25',
 '26',
 '27',
 '20',
 '21',
 '22',
 '23',
 '28',
 '29',
 '40',
 '1',
 '3',
 '2',
 '5',
 '4',
 '7',
 '6',
 '9',
 '8',
 '39',
 '11',
 '10',
 '13',
 '14',
 '17',
 '16',
 '19',
 '18',
 '31',
 '30',
 '37',
 '36',
 '35',
 '34',
 '33',
 '32']

In [31]:
for idx in tqdm.trange(len(minority_classes)):
    minority = minority_classes[idx]
    # num of samples our minority class has
    sample_count = count_dict[minority]
    # num of samples required to make it k-core
    samples_required = k_core - sample_count
    # minority class' filenames
    minority_files = os.listdir(os.path.join(dataset_path, minority))
    jitters_per_sample = int(math.ceil(samples_required/sample_count))
    
    for file_idx in tqdm.trange(len(minority_files)):
        fil = minority_files[file_idx]
        file_path = os.path.join(dataset_path, minority, fil)
        jittered_images = get_jittered_images(file_path, num_jitters=jitters_per_sample, disturb_colors=True)
        
        for i, jittered_image in enumerate(jittered_images): 
            file_name = fil + "_jittered_" + str(i) + ".jpg"
            save_path = os.path.join(dataset_path, minority, file_name)
            ret = cv2.imwrite(save_path, jittered_image)
            if ret is False:
                raise Exception("Failed to write file:{}".format(save_path))
    

  0%|          | 0/37 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:01<00:04,  1.40s/it][A
 50%|█████     | 2/4 [00:02<00:02,  1.39s/it][A
 75%|███████▌  | 3/4 [00:04<00:01,  1.49s/it][A
100%|██████████| 4/4 [00:06<00:00,  1.56s/it][A
  3%|▎         | 1/37 [00:06<03:44,  6.24s/it]
  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:01<00:11,  1.67s/it][A
 25%|██▌       | 2/8 [00:03<00:09,  1.55s/it][A
 38%|███▊      | 3/8 [00:04<00:06,  1.40s/it][A
 50%|█████     | 4/8 [00:05<00:05,  1.39s/it][A
 62%|██████▎   | 5/8 [00:06<00:04,  1.38s/it][A
 75%|███████▌  | 6/8 [00:08<00:02,  1.38s/it][A
 88%|████████▊ | 7/8 [00:09<00:01,  1.38s/it][A
100%|██████████| 8/8 [00:10<00:00,  1.37s/it][A
  5%|▌         | 2/37 [00:17<05:01,  8.61s/it]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:01<00:03,  1.17s/it][A
 50%|█████     | 2/4 [00:02<00:02,  1.27s/it][A
 75%|███████▌  | 3/4 [00:03<00:01,  1.21s/it][A
100%|████

 25%|██▌       | 1/4 [00:01<00:03,  1.05s/it][A
 50%|█████     | 2/4 [00:02<00:02,  1.13s/it][A
 75%|███████▌  | 3/4 [00:03<00:01,  1.17s/it][A
100%|██████████| 4/4 [00:04<00:00,  1.18s/it][A
 76%|███████▌  | 28/37 [02:38<00:50,  5.66s/it]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:01<00:03,  1.27s/it][A
 50%|█████     | 2/4 [00:02<00:02,  1.12s/it][A
 75%|███████▌  | 3/4 [00:03<00:01,  1.16s/it][A
100%|██████████| 4/4 [00:04<00:00,  1.15s/it][A
 78%|███████▊  | 29/37 [02:43<00:44,  5.62s/it]
  0%|          | 0/7 [00:00<?, ?it/s][A
 14%|█▍        | 1/7 [00:01<00:07,  1.29s/it][A
 29%|██▊       | 2/7 [00:02<00:06,  1.34s/it][A
 43%|████▎     | 3/7 [00:03<00:05,  1.31s/it][A
 57%|█████▋    | 4/7 [00:05<00:04,  1.48s/it][A
 71%|███████▏  | 5/7 [00:07<00:02,  1.48s/it][A
 86%|████████▌ | 6/7 [00:08<00:01,  1.46s/it][A
100%|██████████| 7/7 [00:10<00:00,  1.45s/it][A
 81%|████████  | 30/37 [02:53<00:40,  5.78s/it]
  0%|          | 0/4 [00:00<?, ?it/s][

In [32]:
count_dict = {}
for root, dirs, files in os.walk(dataset_path):
    
    for direc in dirs:
            person_id = direc
            count_dict[person_id] = None
                    
    if root != dataset_path:
        count_dict[os.path.split(root)[1]] = len(files)
        
count_dict

{'1': 12,
 '10': 12,
 '11': 12,
 '12': 12,
 '13': 10,
 '14': 12,
 '16': 10,
 '17': 12,
 '18': 12,
 '19': 12,
 '2': 14,
 '20': 10,
 '21': 12,
 '22': 10,
 '23': 12,
 '24': 12,
 '25': 16,
 '26': 12,
 '27': 12,
 '28': 12,
 '29': 10,
 '3': 12,
 '30': 12,
 '31': 14,
 '32': 12,
 '33': 10,
 '34': 12,
 '35': 12,
 '36': 12,
 '37': 12,
 '38': 10,
 '39': 12,
 '4': 12,
 '40': 10,
 '41': 28,
 '5': 10,
 '6': 12,
 '7': 12,
 '8': 10,
 '9': 12}

Steps 3, 4 and 5 are in another notebook - End-to-End: Training