# <b>1. Introduction</b>

### <u>Project Overview</u>

Facial keypoint detection is a computer vision task that involves identifying and locating specific points on a person's face, such as the eyes, nose, and mouth. This project aims to develop a reliable neural network model that can accurately detect these facial keypoints in various images. By utilizing advanced machine learning techniques and data augmentation methods, the project seeks to enhance the precision and consistency of facial feature detection. This is important for applications like facial recognition systems, virtual makeup tools, and interactive technologies that respond to facial expressions.

### <u>Problem Statement</u>

While there have been significant advancements in facial keypoint detection, achieving high accuracy and reliability in real-world conditions remains challenging. Current models often struggle with faces showing extreme angles, partial obstructions, or under different lighting conditions. Additionally, the limited availability of diverse annotated data makes it difficult to create models that work well for everyone. This project aims to overcome these issues by developing a top-performing facial keypoint detection model that remains accurate and dependable across a wide range of real-world scenarios.

The main goals of this project are:

- Model Development: Create and implement a deep learning model, preferably using convolutional neural networks, optimized for precise facial keypoint detection.

- Data Augmentation: Apply advanced data augmentation techniques to increase the diversity of the training dataset, helping the model generalize better to new and unseen facial variations.

- Performance Optimization: Fine-tune the model’s settings and use regularization methods to prevent overfitting, ensuring the model performs well on different datasets.

- Evaluation and Validation: Thoroughly assess the model’s performance using both numerical metrics and visual inspections, comparing it with existing benchmarks to demonstrate its effectiveness.

- Application Integration: Explore how the developed model can be used in areas like facial recognition, emotion detection, and augmented reality, showcasing its practical usefulness and flexibility.

# <b>2. Handling of Training and Inference</b>

## Data Preparation

### <u>Data Preparation:</u>

The project utilizes the Facial Keypoints Detection dataset, which contains images of human faces along with annotated keypoints such as the eyes, nose, and mouth. The dataset is sourced from AFLW and comprises approximately 25000 images. Each image is labeled with 20 keypoints, providing precise locations for facial features. The dataset includes diverse facial expressions, lighting conditions, and angles to ensure the model learns to generalize well across different scenarios.

### <u>Preprocessing Steps:</u>
Before training, the data undergoes several preprocessing steps to ensure consistency and improve model performance:

1. Data Cleaning: Images with missing or corrupted keypoints are removed to maintain data integrity.

2. Normalization: Pixel values of the images are scaled to a range of [0, 1] to facilitate faster and more stable training. Similarly, keypoint coordinates are normalized relative to the image dimensions.

3. Data Augmentation: To increase the diversity of the training data, techniques such as rotations, horizontal flips, and brightness adjustments are applied using the Albumentations library. This helps the model become more robust to variations in real-world data.

It might be worth mentioning that in training, I (Asajad), used data augmentation but other group mates did not. Results will be explained later on.

## Model Architecture:



### <u>Design Choices:</u>

The model is built using a Convolutional Neural Network (CNN) architecture, which is well-suited for image-based tasks. The network consists of multiple convolutional layers followed by pooling layers to extract hierarchical features from the input images. After the convolutional blocks, fully connected layers are used to map the extracted features to the final keypoint coordinates. Dropout layers are included to prevent overfitting by randomly disabling neurons during training.

A CNN was chosen for its proven effectiveness in image recognition and localization tasks. The architecture balances depth and complexity to capture intricate facial features without being overly computationally intensive. Utilizing dropout layers enhances the model's ability to generalize by reducing reliance on specific neurons, thereby improving performance on unseen data.

![Network Architecture](model/network_architecture.png)

## Training Configuration:

### <u>Hyperparameters:</u>
- Learning Rate: Set to 0.0001, it determines the step size during weight updates. A lower learning rate ensures stable convergence.

- Batch Size: Set to 32, it defines the number of samples processed before the model's internal parameters are updated.

- Epochs: The model is trained for 50 epochs, allowing sufficient iterations for learning.

- Weight Decay: Applied at 1e-5 to prevent overfitting by penalizing large weights.

### <u>Training Pipeline:</u>

1. Forward Pass: Input images are fed through the CNN to obtain predicted keypoints.

2. Loss Calculation: The Mean Squared Error (MSE) loss function measures the difference between predicted and actual keypoints.

3. Backward Pass: Gradients are computed using backpropagation.

4. Optimization: The Adam optimizer updates the model's weights based on the gradients.

5. Learning Rate Scheduling: The ReduceLROnPlateau scheduler adjusts the learning rate if the validation loss does not improve, aiding in finer convergence.

6. Regularization: Dropout layers within the network help prevent overfitting by randomly deactivating neurons during training.

## Data Augmentation


### <u>Techniques Applied:</u>
The Albumentations library is employed to apply the following augmentation methods:

- Horizontal Flips: Randomly flips images horizontally to simulate different face orientations.

- Rotations: Rotates images within a range of ±15 degrees to mimic varied head tilts.

- Shift, Scale, and Rotate (ShiftScaleRotate): Combines shifting, scaling, and rotating to create more diverse training samples.

- Brightness and Contrast Adjustments: Alters the brightness and contrast to account for different lighting conditions.

- Gaussian Blur: Applies blurring to simulate out-of-focus scenarios.

The results yielded from the data augmentation were insignificant. The others in my group had their lowest validation loss at 0.0048 while mine yielded a validation loss of 0.0047.

## Inference Pipeline

### <u>Steps Involved:</u>

The inference process follows a streamlined sequence to generate keypoint predictions from input images:

1. Image Input: A new facial image is provided to the trained model.

2. Preprocessing: The image undergoes the same normalization as during training to ensure consistency.

3. Model Prediction: The CNN processes the image and outputs predicted keypoint coordinates.

4. Post-processing: The normalized keypoints are scaled back to the original image dimensions to obtain accurate positions.

5. Visualization: Predicted keypoints are plotted on the image for easy interpretation.


The post-processing steps involved:
- Denormalization: The predicted keypoint coordinates, initially normalized, are scaled back to match the original image size. This involves multiplying the normalized values by the image's width and height.

- Clipping: Ensures that keypoints lie within the image boundaries by clipping any coordinates that fall outside the valid range.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt