# Build the dataset

How we collect and buid our dataset

# 1. Download the data

To obtain data for our face verification system, we use several methods:

- Use pre-built datasets, in this case the LFW dataset

- Use web scraping to collect data from the internet

- Use video or image processing to collect data from real-world sources or capture data using webcam devices or video formats

More details about each method are discussed in the following sections. Regardless of the method you choose to add data, create a folder named `data` inside the project folder and add all your data there.

Note that this `data` folder is not committed to the repository, so run the following code to create it:

In [3]:
# Create data directory if it doesn't exist
import os
os.makedirs('data', exist_ok=True)


## 1.1 Pre build dataset




Access this [link](http://vis-www.cs.umass.edu/lfw/) to download the LFW dataset.

Download -> All images as gzipped tar file -> lfw.tgz. Then move to the project workspace and extract the file using:

```bash
tar -xvzf lfw.tgz
```

After that, observe the `data` folder, you will see subfolders named after the person's name, each subfolder contains images of that person.


![datasetView1](assets/images/datasetView1.png)


The dataset is composed of 13233 images of 5749 people. The dataset is devicded into many subfolders, each subfolder contains images of a specific person. Each image is named as the person's name and a number, size of the image is 250x250 pixels and in JPEG format. The strucutre of the dataset is as follows:

```plaintext
lfw
│
|───person_1
│   │   person_1_001.jpg
│   │   person_1_002.jpg
│   │   ...
│
|───person_2
│   │   person_2_001.jpg
│   │   person_2_002.jpg
│   │   ...
```

Beside this dataset, we plan to use open source datasets provided by the University of Essex (face94, face95 and face96) if we have time and resources.



## 1.2 Collecting data from the internet using web scraping




We use the below script to collect images from the internet using the `simple_image_download` library. Specify the person's name as the keyword, it will search for images of that person and download them to a folder named after the person's name. Note that using this, you will need to manually verify the images to ensure they are of the correct person.

```python
from simple_image_download import simple_image_download as simp

# from simp library call simple_image_download function
response = simp.simple_image_download

# the keywords that will be used to find pics, and each key work will create a different file 
keywords = ["George Wassouf", "Donald Trump", "Selena Gomez"]

# for loop on the keywords
# (kw, 1000) means 300 sample of each keyword 
for kw in keywords:
    response().download(kw, 1000) 
```




## 1.3 Collecting data from the real world using webcam devices or video format




```python
import cv2
import os

# Input Path
video_path = '/home/jawabreh/Desktop/face_scan.MOV'

# Output Path
output_dir = '/home/jawabreh/Desktop/face-recognition/training/person_1'

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Create a VideoCapture object to read the input video
cap = cv2.VideoCapture(video_path)

# Get the total number of frames in the video
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

# Calculate the frame interval to capture for 150 images
frame_interval = total_frames // 1000 # change this number according to your needs 

# Set the initial frame counter to 0
frame_counter = 0

while cap.isOpened():
    # Read a frame from the video
    ret, frame = cap.read()
    
    if not ret:
        break
    
    # Check if this is the frame to capture
    if frame_counter % frame_interval == 0 and frame_counter // frame_interval < 1000:
        # Save the frame as a JPEG image
        output_path = os.path.join(output_dir, f'{frame_counter//frame_interval + 1:03}.jpg')
        cv2.imwrite(output_path, frame)
    
    # Increment the frame counter
    frame_counter += 1
    
    if frame_counter >= total_frames:
        break

# Release the video capture object
cap.release()
print("\n\nDONE\n\n")
```

**An alternative way to capture images from a webcam is to use the following script, using p to capture an image and q to quit the program instead of using a video**:

```python
# Defined the Camera ID to use
CAM_ID = 3 # Establishing the connection with the IR camera

import cv2
import os
import uuid

# Function to save the captured image to the specified folder
def save_image(image, folder_path, img_name):
    img_path = os.path.join(folder_path, img_name)
    cv2.imwrite(img_path, image)


cap = cv2.VideoCapture(CAM_ID)


# Get the name of the person to store in training data
name = input("Name of the person to store in training data: ")

# Loop through every frame in the webcam feed
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Display the frame
    cv2.imshow('Face collect for as training data, press `p` to cpture, `q` for quit', frame)
    
    # Check for key presses
    key = cv2.waitKey(1) & 0xFF
    if key == ord('p'):
        # Save the frame to './data' folder
        save_path = os.path.join('data', name)
        os.makedirs(save_path, exist_ok=True)
        save_image(frame, save_path, str(uuid.uuid1()) + ".jpg")
        print("Image saved to", save_path)

    elif key == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
```

The above code will prompt usr to enter name, then it will capture images from the webcam when press `p` and save them to a folder named after the person's name inside the `data` folder.


- CAM_ID = 0 for laptop normal webcam
- CAM_ID = 2 for laptop IR webcam
- CAM_ID = 4 for external webcam


***Depend on each devices, these number can be different. Try out all number start from 0 and see which one is the correct one on your device.***


Set up camera:

**IR webcam:**

What is IR webcam? https://fptshop.com.vn/tin-tuc/danh-gia/ir-camera-la-gi-153147


**External webcam**

Since the resolution of the laptop webcam is not good, we will use an external webcam like from a mobile phone. To connect the webcam to the laptop:

1. Download the DroidCam app on your phone, also the DroidCam client on your laptop: https://www.dev47apps.com/
2. Set up as the instruction on the website. With Linux:

```bash
wget -O droidcam_latest.zip https://files.dev47apps.net/linux/droidcam_2.1.3.zip
unzip droidcam_latest.zip -d droidcam
cd droidcam && sudo ./install-client
sudo apt install libappindicator3-1

# Fix missing video device
sudo apt install linux-headers-`uname -r` gcc make
sudo ./install-video
```
3. Open both DroidCam on phone and DroidCam client on laptop, connect the phone to the laptop via USB or Wifi.


> **NOTE:** Currently, we are just testing on the LWF dataset, but use the second and third methods can be use to enhace the diversity or specific data we need for our face verification system. Most of iamges inside the LWF dataset are people in Western countries, so we can use the second and third methods to collect images of people from our specific region or country. This will help the model perfrom better on our target market.

# 1.4 Reduce amount of data

Incase you want to run this notebook on your machine, but the size of LWF dataset overwhelms your machine, you can run the following code delete some random subfolders from the dataset, only keeping as you want.


In [6]:
lfw_dir = './lfw' # REPLACE WITH YOUR PATH to the LFW dataset (after extracting the zip file)
# In my case, it located in the same directory as this notebook

# We do not wnat to modify/delete directly the original lfw dataset, 
# so we will copy it to the `data` directory we created before and process
data_dir = './data'

import os
import shutil

# Create the data folder if it doesn't exist
os.makedirs(data_dir, exist_ok=True)

# Copy the content of the lfw folder to the data folder
for item in os.listdir(lfw_dir):
    s = os.path.join(lfw_dir, item)
    d = os.path.join(data_dir, item)
    if os.path.isdir(s):
        if not os.path.exists(d):
            shutil.copytree(s, d)
        else:
            for sub_item in os.listdir(s):
                sub_s = os.path.join(s, sub_item)
                sub_d = os.path.join(d, sub_item)
                if os.path.isdir(sub_s):
                    shutil.copytree(sub_s, sub_d, dirs_exist_ok=True)
                else:
                    shutil.copy2(sub_s, sub_d)
    else:
        shutil.copy2(s, d)

In [7]:
# Perform randomly delete subfolders to reudce the size of the dataset

import random

# Get a list of all subfolders
subfolders = [f.path for f in os.scandir(data_dir) if f.is_dir()]

# Shuffle the list of subfolders
random.shuffle(subfolders)

# Keep only 20 subfolders
subfolders_to_keep = subfolders[:20]

# Delete the remaining subfolders
for subfolder in subfolders[20:]:
    for root, dirs, files in os.walk(subfolder, topdown=False):
        for name in files:
            os.remove(os.path.join(root, name))
        for name in dirs:
            os.rmdir(os.path.join(root, name))
    os.rmdir(subfolder)

print(f"Kept {len(subfolders_to_keep)} subfolders and deleted the rest.")

Kept 20 subfolders and deleted the rest.


# 2. Data Arrangement

Create augmented images and store them in the same folder as the original images, which is the `data` folder.

Inside the `data` folder, there are many subfolders, each containing images of a person. The number of images in each subfolder varies. Count the number of images in each subfolder. If a subfolder has many images (high density), apply only a few augmentation operations to each image. Otherwise (low density), apply more augmentations to each image.

In [8]:
import cv2
import os
import random

# For argumetation operations
from albumentations import (
    Compose,
    RandomBrightnessContrast,
    VerticalFlip,
    HorizontalFlip,
    Rotate,
    ShiftScaleRotate,
    HueSaturationValue,
    GaussianBlur,
    GaussNoise,
    ElasticTransform,
    GridDistortion,
    CLAHE,
)

In [None]:
# Count number of images inside each subfolder, if there are quite many (high density)
# we will only a few argumetation operations to each image, othersise,(low density) we will apply more augmentations to each image.
def count_images(folder):
    """Count number of image files in folder"""
    return len([f for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))])


# Based on the number of images in a folder (density), we will decide the how many augmentations we will apply to each image.
def create_augmentations(density='high'):
    """Create list of augmentations based on density"""
    if density == 'high':
        # Limited augmentations with high density
        augmentations = [
            RandomBrightnessContrast(p=1.0),
            HorizontalFlip(p=1.0),
        ]
    else:
        # More augmentations: Brightless,Horizontal Flip, Rotate, Zoom, Vary color, ...
        augmentations = [
            RandomBrightnessContrast(p=1.0),
            HorizontalFlip(p=1.0),
            Rotate(limit=60, p=1.0),
            ShiftScaleRotate(
                shift_limit=0.3,
                scale_limit=0.3,
                rotate_limit=20,
                p=1.0
            ),
            HueSaturationValue(p=1.0),
        ]
    return augmentations


# Augment images in a folder using predefined density transformations
def augment_folder(folder_path, density):
    """Augment images in folder using predefined density transformations"""
    augmentations = create_augmentations(density)
    image_files = [f for f in os.listdir(folder_path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    # Randomly decide how many images to augment (both in the number and which images)
    num_to_augment = random.randint(1, len(os.listdir(folder_path)))
    images_to_augment = random.sample(os.listdir(folder_path), num_to_augment)
    
    for img_name in images_to_augment:
        img_path = os.path.join(folder_path, img_name)
        try:
            img = cv2.imread(img_path)
            if img is None:
                continue
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            for aug in augmentations:
                # Apply each augmentation separately
                augmented = aug(image=img)
                augmented_image = augmented["image"]
                
                filename, ext = os.path.splitext(img_name)
                aug_name = aug.__class__.__name__
                aug_filename = f"{filename}_{aug_name}{ext}"
                aug_path = os.path.join(folder_path, aug_filename)
                
                aug_bgr = cv2.cvtColor(augmented_image, cv2.COLOR_RGB2BGR)
                cv2.imwrite(aug_path, aug_bgr) # Store  name of original image + augmentation type
                
        except Exception as e:
            print(f"Error processing {img_path}: {str(e)}")
            
    print(f"Augmentation completed for folder: {folder_path} with density: {density}")

In [10]:
data_directory = 'data'
DENSITY_THRESHOLD = 3 # If inside as folder, there are more than 3 images, we consider it as high density, otherwise, low density

# fProcessing each subfolder in the data directory
for subfolder in os.listdir(data_directory):
    folder_path = os.path.join(data_directory, subfolder)
    if os.path.isdir(folder_path):
        num_images = count_images(folder_path)
        if num_images > DENSITY_THRESHOLD:
            density = 'high'
        else:
            density = 'low'
        augment_folder(folder_path, density)

Augmentation completed for folder: data/Robert_Lee_Yates_Jr with density: low
Augmentation completed for folder: data/Stan_Kroenke with density: low
Augmentation completed for folder: data/Raza_Rabbani with density: low
Augmentation completed for folder: data/Bill_Byrne with density: low
Augmentation completed for folder: data/Kevin_Tarrant with density: low
Augmentation completed for folder: data/TA_McLendon with density: low
Augmentation completed for folder: data/Michael_Shelby with density: low
Augmentation completed for folder: data/Arthur_Johnson with density: low
Augmentation completed for folder: data/Francisco_Garcia with density: low
Augmentation completed for folder: data/Zahir_Shah with density: low
Augmentation completed for folder: data/Ellen_Pompeo with density: low
Augmentation completed for folder: data/Elizabeth_Hurley with density: high
Augmentation completed for folder: data/Paul_Coppin with density: low
Augmentation completed for folder: data/Rick_Husband with dens

# 3. Complete data folder

You can add your own data as you liek, just put all of them inside the `data` folder. The structure of the data folder should be as follows:

```plaintext
data
│
|───person_1
│   │   person_1_001.jpg
│   │   person_1_002.jpg
│   │   ...
│
|───person_2
│   │   person_2_001.jpg
│   │   person_2_002.jpg
│   │   ...
```

Now, we already have the dataset, we continue to preprocessing these data for the training process. 

- For the first pipleine (Facenet + SVM), check the data process inside the `Pipeline1 DataPreprocessing.ipynb` notebook, then the training pharse inside the `SVM_Classifier.ipynb` notebook.

- For the second pipeline (Siamese Architecture + L1 distance), the preprocessing data process is inside the `Pipeline2 DataPreprocessing.ipynb` notebook, then the training pharse inside the `Siamese_Network.ipynb` notebook.