## End-to-End Procedure: Data cleaning

### Procedure Outline
1. Filter the dataset ***<--- This notebook***
    - Detect faces among all the images. Reject images that have more than one face, or no face.
    - Save filtered dataset to new location.
2. Generate Train-Test Splits ***<--- This notebook***
    - Create folds.
3. Evaluate 
    - Generate embeddings from the splits
    - Train classifier on the embeddings
    - Test classifier on the embeddings
4. Tune classifier
    - Tune the classifier 
5. Save the model

### Imports

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [24]:
import os
import cv2
import pprint
import logging
import tqdm
import math

import face_trigger

In [3]:
from face_trigger.model.deep.FaceRecognizer import FaceRecognizer
from face_trigger.process.post_process import FaceDetector, LandmarkDetector, FaceAlign
from face_trigger.utils.common import RepeatedTimer, clamp_rectangle
from face_trigger.utils.data import dataset_filter, Dataset

In [4]:
unnormalized_dataset_path = "/media/ankurrc/new_volume/softura/facerec/datasets/softura_emp"
dataset_path = "/media/ankurrc/new_volume/softura/facerec/softura_filtered"
split_path = "/media/ankurrc/new_volume/softura/facerec/softura_split_path"

In [5]:
logging.basicConfig(level=logging.DEBUG)

### Filter dataset
 **Note:** While filtering the dataset we assume that the original dataset has the following structure:
 1. At the root level there are directories that represent each personality. The directories may or may not have a numeric name.
 2. Within each directory, the files should represent the images that contain the parent directory's(personality) faces. The file names may or may not be numeric.


In [6]:
def filter_the_dataset(unnormalized_dataset_path=None, dataset_path=None):
    return dataset_filter(dataset_path=unnormalized_dataset_path, output_path=dataset_path)

In [7]:
rejected_dirs = filter_the_dataset(unnormalized_dataset_path=unnormalized_dataset_path, dataset_path=dataset_path)

100%|██████████| 40/40 [01:45<00:00,  2.23s/it]
INFO:face_trigger.utils.data:Filtered dataset created at /media/ankurrc/new_volume/softura/facerec/softura_filtered


Rejected directories:
{'20': ['IMG_0093.jpg']}


### Generate Splits (Optional)
**Note:** Generating manual splits is optional. In the training section, we use sklearn's cv.
  If you decide to use manual-splitting, you need to change the training routine as well.

In [11]:
def generate_splits(dataset_path=None, split_path=None):
    dataset = Dataset(dataset_path=dataset_path,
                      split_path=split_path)
    folds = 3
    training_samples = [2, 5, 8]
    
    dataset.split(num_train_list=training_samples, folds=folds)

In [12]:
generate_splits(dataset_path=dataset_path, split_path=split_path)

INFO:face_trigger.utils.data:Generating for 2 training samples per subject.
INFO:face_trigger.utils.data:Generating: Fold 1
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/1
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/1/train.csv
INFO:face_trigger.utils.data:Generating: Fold 2
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/2
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/2/train.csv
INFO:face_trigger.utils.data:Generating: Fold 3
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/3
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/3/train.csv
INFO:face_trigger.utils.data:We have 40 subje

### Augment dataset by jittering

In [8]:
# minimum samples for each class
k_core = 10

X_oversampled = []
y_oversampled = []

In [20]:
count_dict = {}
for root, dirs, files in os.walk(dataset_path):
    
    for direc in dirs:
            person_id = direc
            count_dict[person_id] = None
                    
    if root != dataset_path:
        count_dict[os.path.split(root)[1]] = len(files)

In [22]:
# figure out which classes have a count less than k_core
minority_classes = []
for key, val in count_dict.items():
    if val < k_core:
        minority_classes.append(key)

In [25]:
# randomly resample for each minority class
for minority in minority_classes:
    # num of samples our minority class has
    sample_count = count_dict[minority]
    # num of samples required to make it k-core
    samples_required = k_core - sample_count
    # minority class' filenames
    minority_files = os.listdir(os.path.join(dataset_path, minority))
    jitters_per_sample = int(math.ceil(samples_required/sample_count))

    

NameError: name 'minority_indices' is not defined

Steps 3, 4 and 5 are in another notebook - End-to-End: Training