## End-to-End Procedure: Data cleaning

### Procedure Outline
1. Filter the dataset ***<--- This notebook***
    - Detect faces among all the images. Reject images that have more than one face, or no face.
    - Save filtered dataset to new location.
2. Generate Train-Test Splits ***<--- This notebook***
    - Create folds.
3. Evaluate 
    - Generate embeddings from the splits
    - Train classifier on the embeddings
    - Test classifier on the embeddings
4. Tune classifier
    - Tune the classifier 
5. Save the model

### Imports

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [3]:
import os
import cv2
import pprint
import logging
import tqdm

import face_trigger

In [4]:
from face_trigger.model.deep.FaceRecognizer import FaceRecognizer
from face_trigger.process.post_process import FaceDetector, LandmarkDetector, FaceAlign
from face_trigger.utils.common import RepeatedTimer, clamp_rectangle
from face_trigger.utils.data import dataset_filter, Dataset

In [5]:
unnormalized_dataset_path = "/media/ankurrc/new_volume/softura/facerec/datasets/cyber_extruder_ultimate"
dataset_path = "/media/ankurrc/new_volume/softura/facerec/ceu_filtered"
split_path = "/media/ankurrc/new_volume/softura/facerec/ceu_split_path"

In [6]:
logging.basicConfig(level=logging.DEBUG)

### Filter dataset
 **Note:** While filtering the dataset we assume that the original dataset has the following structure:
 1. At the root level there are directories that represent each personality. The directories may or may not have a numeric name.
 2. Within each directory, the files should represent the images that contain the parent directory's(personality) faces. The file names may or may not be numeric.


In [7]:
def filter_the_dataset(unnormalized_dataset_path=None, dataset_path=None):
    return dataset_filter(dataset_path=unnormalized_dataset_path, output_path=dataset_path)

In [8]:
rejected_dirs = filter_the_dataset(unnormalized_dataset_path=unnormalized_dataset_path, dataset_path=dataset_path)

100%|██████████| 1000/1000 [07:41<00:00,  2.59it/s]
INFO:face_trigger.utils.data:Filtered dataset created at /media/ankurrc/new_volume/softura/facerec/ceu_filtered


Rejected directories:
{'000008': ['000007.jpg'],
 '000015': ['000023.jpg'],
 '000018': ['000001.jpg'],
 '000036': ['000010.jpg'],
 '000060': ['000010.jpg'],
 '000063': ['000006.jpg'],
 '000067': ['000012.jpg'],
 '000088': ['000013.jpg'],
 '000092': ['000011.jpg'],
 '000106': ['000006.jpg'],
 '000113': ['000011.jpg'],
 '000121': ['000009.jpg'],
 '000126': ['000008.jpg'],
 '000131': ['000019.jpg'],
 '000134': ['000002.jpg'],
 '000149': ['000003.jpg'],
 '000155': ['000026.jpg'],
 '000176': ['000024.jpg'],
 '000180': ['000013.jpg'],
 '000183': ['000002.jpg'],
 '000186': ['000002.jpg'],
 '000187': ['000023.jpg'],
 '000188': ['000013.jpg'],
 '000191': ['000011.jpg'],
 '000203': ['000012.jpg'],
 '000216': ['000024.jpg'],
 '000226': ['000011.jpg'],
 '000242': ['000003.jpg'],
 '000261': ['000007.jpg'],
 '000269': ['000011.jpg'],
 '000309': ['000004.jpg'],
 '000310': ['000014.jpg'],
 '000324': ['000004.jpg'],
 '000334': ['000001.jpg'],
 '000357': ['000007.jpg'],
 '000362': ['000006.jpg'],
 '0003

### Generate Splits 
**Note:** Generating manual splits is optional. In the training section, we use sklearn's cv.
  If you decide to use manual-splitting, you need to change the training routine as well.

In [11]:
def generate_splits(dataset_path=None, split_path=None):
    dataset = Dataset(dataset_path=dataset_path,
                      split_path=split_path)
    folds = 3
    training_samples = [2, 5, 8]
    
    dataset.split(num_train_list=training_samples, folds=folds)

In [12]:
generate_splits(dataset_path=dataset_path, split_path=split_path)

INFO:face_trigger.utils.data:Generating for 2 training samples per subject.
INFO:face_trigger.utils.data:Generating: Fold 1
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/1
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/1/train.csv
INFO:face_trigger.utils.data:Generating: Fold 2
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/2
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/2/train.csv
INFO:face_trigger.utils.data:Generating: Fold 3
INFO:face_trigger.utils.data:Creating directory: /media/ankurrc/new_volume/softura/facerec/att_split_path/2/3
INFO:face_trigger.utils.data:done.
INFO:face_trigger.utils.data:/media/ankurrc/new_volume/softura/facerec/att_split_path/2/3/train.csv
INFO:face_trigger.utils.data:We have 40 subje

Steps 3, 4 and 5 are in another notebook - End-to-End: Training