# Building Datasets For Model Training

Carrying on from [Issue #6](https://github.com/dannylyl/siamese-nn-facial-recognition/issues/6), this issue will involve steps for building the datasets for further downstream steps in this project. This would include:

1. Deciding Train Test Split Methodology and Implementation
2. Building Positive (similar) and Negative (dissimilar) pairs and labels

## 1. Imports

In [1]:
import os
import pandas as pd
import PIL

## 2. Getting Pandas DF of Images

Similar to the notebook for issue #6, I will start by first organising the data into a pandas dataframe (personal preference I guess since I usually work with data in DFs).

In [2]:
data_dir = '../data/lfw-deepfunneled/lfw-deepfunneled'

data = []

for person in os.listdir(data_dir):
    person_dir = os.path.join(data_dir, person)
    if os.path.isdir(person_dir):
        for image in os.listdir(person_dir):
            data.append((os.path.join(person_dir, image), person))
            
df = pd.DataFrame(data, columns=['image_path', 'person'])

In [3]:
df.head()

Unnamed: 0,image_path,person
0,../data/lfw-deepfunneled/lfw-deepfunneled/Alic...,Alice_Fisher
1,../data/lfw-deepfunneled/lfw-deepfunneled/Alic...,Alice_Fisher
2,../data/lfw-deepfunneled/lfw-deepfunneled/Elle...,Ellen_Barkin
3,../data/lfw-deepfunneled/lfw-deepfunneled/Quee...,Queen_Latifah
4,../data/lfw-deepfunneled/lfw-deepfunneled/Quee...,Queen_Latifah


## 3. Train Test Split Methodology

As suggested from Issue #6, we have:

* A total of `5749` unique persons in our dataset. `1680` people have multiple images, and `4068` people with only 1 image
* A total of `13233` images, of which `9164` are part of the multiple images per person, and naturally `4068` solo images.

Thinking in a way to reduce the chances of data leakage, or just to test the model's generalisability, perhaps it would be better to perform a train test split on the individual people, and not on an image level.

If the end goal is a model that can perform one-shot facial verification on multiple people, the model should be able to perform well on many people, and so the evaluation of the model should involve assessing its performance on unseen people.

Now, something to address with that methodology, is the fact that some people have more multiple images of themselves, so performing a split on unique persons without taking into account the number of images there are per person might not be the best way to go. 

As brought up in Issue #6, we can technically have many permutations of positive and negative labels just from matching different images together, which would increase our dataset. But let's still split them while taking into account the number of images.