# Building Datasets For Model Training

Carrying on from [Issue #6](https://github.com/dannylyl/siamese-nn-facial-recognition/issues/6), this issue will involve steps for building the datasets for further downstream steps in this project. This would include:

1. Deciding Train Test Split Methodology and Implementation
2. Building Positive (similar) and Negative (dissimilar) pairs and labels

## 1. Imports

In [1]:
import os
import pandas as pd
import PIL
from sklearn.model_selection import train_test_split

## 2. Getting Pandas DF of Images

Similar to the notebook for issue #6, I will start by first organising the data into a pandas dataframe (personal preference I guess since I usually work with data in DFs).

In [2]:
data_dir = '../data/lfw-deepfunneled/lfw-deepfunneled'

data = []

for person in os.listdir(data_dir):
    person_dir = os.path.join(data_dir, person)
    if os.path.isdir(person_dir):
        for image in os.listdir(person_dir):
            data.append((os.path.join(person_dir, image), person))
            
df = pd.DataFrame(data, columns=['image_path', 'person'])

In [3]:
df.head()

Unnamed: 0,image_path,person
0,../data/lfw-deepfunneled/lfw-deepfunneled/Alic...,Alice_Fisher
1,../data/lfw-deepfunneled/lfw-deepfunneled/Alic...,Alice_Fisher
2,../data/lfw-deepfunneled/lfw-deepfunneled/Elle...,Ellen_Barkin
3,../data/lfw-deepfunneled/lfw-deepfunneled/Quee...,Queen_Latifah
4,../data/lfw-deepfunneled/lfw-deepfunneled/Quee...,Queen_Latifah


## 3. Train Test Split Methodology

As suggested from Issue #6, we have:

* A total of `5749` unique persons in our dataset. `1680` people have multiple images, and `4068` people with only 1 image
* A total of `13233` images, of which `9164` are part of the multiple images per person, and naturally `4068` solo images.

Thinking in a way to reduce the chances of data leakage, or just to test the model's generalisability, perhaps it would be better to perform a train test split on the individual people, and not on an image level.

If the end goal is a model that can perform one-shot facial verification on multiple people, the model should be able to perform well on many people, and so the evaluation of the model should involve assessing its performance on unseen people.

Now, something to address with that methodology, is the fact that some people have more multiple images of themselves, so performing a split on unique persons without taking into account the number of images there are per person might not be the best way to go. 

As brought up in Issue #6, we can technically have many permutations of positive and negative labels just from matching different images together, which would increase our dataset. But let's still split them while taking into account the number of images.

#### a. Creating DF with Counts, and Separating the People with Unique Counts of Images

In [4]:
person_counts = df['person'].value_counts().reset_index()
person_counts.columns = ['person', 'num_photos']

In [5]:
person_counts.head()

Unnamed: 0,person,num_photos
0,George_W_Bush,530
1,Colin_Powell,236
2,Tony_Blair,144
3,Donald_Rumsfeld,121
4,Gerhard_Schroeder,109


One good way to split our dataset into train and test sets would be to make sure the proportion of photos per person is constant across the split. This might help the model to generalise to the test set a bit better.

Firstly, let's start by checking if there are any people with unique numbers of images in the dataset. We kind of already know that there are from the cell above, which shows 5 people with non repeating `num_photos`.

In [6]:
num_photo_count= person_counts['num_photos'].value_counts()
num_photo_count.tail()

num_photos
530    1
236    1
144    1
27     1
25     1
Name: count, dtype: int64

In [7]:
unique_counts = []
for num_photos, count in num_photo_count.items():
    if count == 1:
        unique_counts.append(num_photos)
        
print(f'Number of people whose count of images is unique: {len(unique_counts)}')

Number of people whose count of images is unique: 18


#### b. Train Test Splitting the Datasets, Stratifying on Number of Photos per Person

So we have 18 people whose count of images in unique. Let's extract them out of the dataset first and do a train test split on the rest. We'll do an 80 / 20 split. I am not going to split the datasets in to Train, test and validation as this is meant to be quick, small project, and I want to keep things simple for now.

In [8]:
unique_count_people = person_counts[person_counts['num_photos'].isin(unique_counts)]
unique_count_people.tail()

Unnamed: 0,person,num_photos
21,Nestor_Kirchner,37
22,Andre_Agassi,36
23,Alvaro_Uribe,35
38,Ricardo_Lagos,27
41,Tom_Daschle,25


In [9]:
person_counts = person_counts[~person_counts['person'].isin(unique_count_people['person'])]

In [10]:
person_counts['person'].nunique()

5731

In [11]:
train_persons, test_persons = train_test_split(person_counts, test_size=0.2, stratify=person_counts['num_photos'], random_state=42)

In [12]:
train_persons['person'].nunique(), test_persons['person'].nunique()

(4584, 1147)

We have 4584 people in the train set, and 1147 people in the test set. Nice! Let's check the total number of images as well.

In [13]:
num_train_images = train_persons.sum()['num_photos']
num_test_images = test_persons.sum()['num_photos']

print(f'Number of training images: {num_train_images}, Number of testing images: {num_test_images}')

Number of training images: 9234, Number of testing images: 2242


That's a pretty nice split so far. The next thing we can do is incorporate the people with unique image counts into the train and test set. For this, we can just manually assign them.

#### c. Manually Assigning People with Unique Image Counts into Train or Test Set

So far, we've done a train test split based on unique persons, and stratifying by the number of images. We have 18 people with unique counts, which is relatively few compared to the 4584 people in the train set and 1147 in the test. For these 18 people, let's take a slightly different approach and focus on splitting them by the number of images instead, since the range of the number of images is so wide (25 to 530). 

Total number of images of people with unique image counts:

In [19]:
unique_count_people['num_photos'].sum()

np.int64(1757)

As we split the train and test set to 80 / 20, let's try and get to a similar ratio for the people with unique image counts as well.

In [20]:
train_number = 0.8 * unique_count_people['num_photos'].sum()
test_number = 0.2 * unique_count_people['num_photos'].sum()
print(f'Number of unique images in training set: {train_number}, Number of unique images in testing set: {test_number}')

Number of unique images in training set: 1405.6000000000001, Number of unique images in testing set: 351.40000000000003


In [21]:
unique_count_people

Unnamed: 0,person,num_photos
0,George_W_Bush,530
1,Colin_Powell,236
2,Tony_Blair,144
3,Donald_Rumsfeld,121
4,Gerhard_Schroeder,109
5,Ariel_Sharon,77
6,Hugo_Chavez,71
7,Junichiro_Koizumi,60
8,Jean_Chretien,55
9,John_Ashcroft,53


In [27]:
unique_count_people.tail(9).sum()

person        John_AshcroftVladimir_PutinLuiz_Inacio_Lula_da...
num_photos                                                  354
dtype: object

Alright, getting the last 9 people in the `unique_count_people` dataframe gets us a total of 354 images, not exactly 351, but that's fine of course. Now let's assign them to the test set.

In [32]:
unique_test_people = unique_count_people.tail(9)
unique_train_people = unique_count_people.drop(unique_test_people.index)
print((unique_train_people['num_photos'].sum(), unique_test_people['num_photos'].sum()))

(np.int64(1403), np.int64(354))


In [34]:
train_persons = pd.concat([train_persons, unique_train_people])
test_persons = pd.concat([test_persons, unique_test_people])

print((train_persons['num_photos'].sum(), test_persons['num_photos'].sum()))

(np.int64(12040), np.int64(2950))


Great, we've successfully split our dataset into train and test sets in an 80/20 split. To make things easier, let's save these dataframes in parquet format for further issues, and let's also do some manipulation so that we have dataframes similar to the initial format, with the paths to the images.

## 4. Saving Dataframes

In [46]:
train_df = pd.merge(train_persons, df, on='person', how='left')
test_df = pd.merge(test_persons, df, on='person', how='left')

In [45]:
train_df

Unnamed: 0,person,num_photos,image_path
0,Alicia_Hollowell,1,../data/lfw-deepfunneled/lfw-deepfunneled/Alic...
1,Dwayne_Johnson,2,../data/lfw-deepfunneled/lfw-deepfunneled/Dway...
2,Dwayne_Johnson,2,../data/lfw-deepfunneled/lfw-deepfunneled/Dway...
3,Jim_Doyle,1,../data/lfw-deepfunneled/lfw-deepfunneled/Jim_...
4,Pierre_Lacroix,1,../data/lfw-deepfunneled/lfw-deepfunneled/Pier...
...,...,...,...
12035,Jean_Chretien,55,../data/lfw-deepfunneled/lfw-deepfunneled/Jean...
12036,Jean_Chretien,55,../data/lfw-deepfunneled/lfw-deepfunneled/Jean...
12037,Jean_Chretien,55,../data/lfw-deepfunneled/lfw-deepfunneled/Jean...
12038,Jean_Chretien,55,../data/lfw-deepfunneled/lfw-deepfunneled/Jean...


In [47]:
test_df

Unnamed: 0,person,num_photos,image_path
0,Luis_Berrondo,1,../data/lfw-deepfunneled/lfw-deepfunneled/Luis...
1,Antonio_Palocci,8,../data/lfw-deepfunneled/lfw-deepfunneled/Anto...
2,Antonio_Palocci,8,../data/lfw-deepfunneled/lfw-deepfunneled/Anto...
3,Antonio_Palocci,8,../data/lfw-deepfunneled/lfw-deepfunneled/Anto...
4,Antonio_Palocci,8,../data/lfw-deepfunneled/lfw-deepfunneled/Anto...
...,...,...,...
2945,Tom_Daschle,25,../data/lfw-deepfunneled/lfw-deepfunneled/Tom_...
2946,Tom_Daschle,25,../data/lfw-deepfunneled/lfw-deepfunneled/Tom_...
2947,Tom_Daschle,25,../data/lfw-deepfunneled/lfw-deepfunneled/Tom_...
2948,Tom_Daschle,25,../data/lfw-deepfunneled/lfw-deepfunneled/Tom_...


In [49]:
train_df.to_parquet('../data/train.parquet')
test_df.to_parquet('../data/test.parquet')