# CBU5201 mini-project submission

The mini-project has two separate components:


1.   **Basic component** [6 marks]: Using the genki4k dataset, build a machine learning pipeline that takes as an input an image and predicts 1) whether the person in the image is similing or not 2) estimate the 3D head pose labels in the image.
2.   **Advanced component** [10 marks]: Formulate your own machine learning problem and build a machine learning solution using the genki4k dataset (https://inc.ucsd.edu/mplab/398/). 

Your submission will consist of two Jupyter notebooks, one for the basic component and another one for advanced component. Please **name each notebook**:

* CBU5201_miniproject_basic.ipynb
* CBU5201_miniproject_advanced.ipynb

then **zip and submit them toghether**.

Each uploaded notebook should include: 

*   **Text cells**, describing concisely each step and results.
*   **Code cells**, implementing each step.
*   **Output cells**, i.e. the output from each code cell.

and **should have the structure** indicated below. Notebooks might not be run, please make sure that the output cells are saved.

How will we evaluate your submission?

*   Conciseness in your writing (10%).
*   Correctness in your methodology (30%).
*   Correctness in your analysis and conclusions (30%).
*   Completeness (10%).
*   Originality (10%).
*   Efforts to try something new (10%).

Suggestion: Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. 

Each notebook should be structured into the following 9 sections:


# 1 Author

**Student Name**: yida Wang 
**Student ID**:  210978645(QMUL) 2021212787(BUPT)

**Access this code from github** https://github.com/akawangyida/BUPT-mini-project.git



# 2 Problem formulation

Describe the machine learning problem that you want to solve and explain what's interesting about it.

Developing a Facial Expression and Pose Analysis System: In this initiative, the goal is to build a sophisticated machine learning model leveraging the genki4k dataset. This model is tasked with dual functions: identifying whether the person in an image is smiling, and gauging the 3D orientation of the person's head. The key challenge here is to precisely capture and interpret nuanced facial expressions and the directional positioning of the head. This endeavor holds considerable potential in enhancing interactive technologies and contributing to research in the field of human-machine interaction.

# 3 Machine Learning pipeline

Development of a Facial Analysis System: The challenge is to create a computational model using the genki4k dataset. This model should analyze an input image to determine two key aspects: firstly, whether the individual in the image is smiling, and secondly, to compute the orientation of the head in three-dimensional space. This problem is fascinating as it combines emotion recognition with spatial orientation analysis, showcasing the potential of AI in understanding human expressions and poses.

# 4 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

Transformation Stage: Feature Extraction
Input:
Preprocessed Images: The input to this stage is the array of preprocessed images. These images have already been resized to 64x64 pixels and normalized (pixel values scaled to the range [0, 1]).
Feature Extraction Process:
Convolutional Layers (Conv2D):

The CNN employs multiple convolutional layers.
Each layer uses a set of learnable filters to capture various aspects of the image, such as edges, textures, and other patterns.
The activation function 'relu' (Rectified Linear Unit) introduces non-linearity, allowing the model to learn more complex features.
Pooling Layers (MaxPooling2D):

After each convolutional layer, a pooling layer reduces the spatial dimensions (height and width) of the output, condensing the feature maps.
This reduction helps in decreasing the computational load and the number of parameters, making the network less prone to overfitting.
Flattening:

The Flatten layer is used to convert the 2D feature maps into a 1D vector. This transformation is necessary to feed the data into the dense layers for classification.
Dense Layers and Dropout:

Dense layers further process the features, culminating in a classification output.
Dropout is applied to prevent overfitting by randomly setting a fraction of the input units to zero during training.

# 5 Modelling

Describe the ML model(s) that you will build. Explain why you have chosen them.

CNN Model Description:
Convolutional Layers (Conv2D):
These layers are the core building blocks of a CNN. They are effective in extracting features from images by applying various filters that capture aspects like edges, textures, and patterns.
Pooling Layers (MaxPooling2D):
These layers follow the convolutional layers and are used to reduce the spatial size of the representation. This reduction helps in decreasing the computational power required, while also helping to extract dominant features by reducing noise.
Flattening:
The Flatten layer is used to convert the 2D feature maps into a 1D vector, which is necessary for the final classification step.
Dense Layers and Dropout:
After flattening, one or more dense layers are used for classification. The dropout layer is included to prevent overfitting by randomly dropping a fraction of the neurons.
Output Layer:
The final Dense layer uses a 'softmax' activation function for multi-class classification or 'sigmoid' for binary classification, providing the probability of the image belonging to each class.
Reasons for Choosing a CNN:
Specialization in Image Processing:
CNNs are specifically designed for processing data with a grid-like topology, such as images. They are highly efficient in handling image data due to their ability to capture spatial hierarchies.
Feature Extraction Capability:
CNNs automatically and adaptively learn spatial hierarchies of features. This is crucial for image classification tasks where manual feature extraction is complex and inefficient.
Robustness and Accuracy:
They are known for their robustness to variations in the input and have been proven to achieve high accuracy in various image classification tasks.
Scalability:
CNNs can be scaled easily in terms of depth and complexity, allowing for fine-tuning and optimization based on the specific requirements of the dataset and task.

# 6 Methodology

Describe how you will train and validate your models, how model performance is assesssed (i.e. accuracy, confusion matrix, etc)

For the first task, the model's performance will be evaluated using accuracy and a classification report (class_report). Accuracy will provide a quick overview of the overall effectiveness of the model, while the classification report will offer detailed insights into precision, recall, and F1-score for each class, allowing for a more nuanced understanding of the model's performance.

For the second task, Mean Squared Error (MSE) will be used as the key performance metric. MSE will offer a clear indication of how close the model's predictions are to the actual values, with lower values indicating better performance. This metric is particularly useful for regression tasks or models predicting continuous outco

# 7 Dataset

Describe the dataset that you will use to create your models and validate them. If you need to preprocess it, do it here. Include visualisations too. You can visualise raw data samples or extracted features.

The dataset to be used for the models comprises images along with their corresponding labels. These images are likely varied in content and need to be preprocessed for effective model training and validation.

Preprocessing Steps:
Loading and Resizing: Images are loaded from a directory, converted to a consistent size (like 64x64 pixels) for uniformity in input.
Normalization: Pixel values are normalized to a range of 0 to 1 to facilitate efficient training.
Reshaping: Images are reshaped to match the input requirements of the CNN (adding a channel dimension for RGB images).
Visualization:
Sample Images: Display a few images from the dataset to understand the diversity and characteristics of the data.
Label Distribution: Visualize the distribution of different labels (classes) in the dataset to identify any imbalance.
These visualizations help in understanding the dataset better, ensuring that the preprocessing steps are aligned with the needs of the models to be trained.

# 8 Results

Carry out your experiments here, explain your results.

Task 1: Smile Detection
Test Accuracy: 79.37%
Classification Report:
Class0 (Not Smiling): Precision - 76%, Recall - 77%, F1-score - 77%.
Class1 (Smiling): Precision - 82%, Recall - 81%, F1-score - 81%.
Analysis:
The model performs fairly well in distinguishing smiling from non-smiling faces, with a slightly better performance in correctly identifying smiling faces (Class1).
The balance between precision and recall for both classes indicates a good model performance, with slightly better results for Class1.

Task 2: 3D Head Pose Estimation
Test Mean Squared Error (MSE): 0.01763
R-squared: 0.33458
Analysis:
The MSE is relatively low, suggesting that the model's predictions are, on average, close to the actual 3D head pose values. However, the interpretation of MSE should be contextual to the range and scale of the head pose angles.
The R-squared value, a measure of how well the observed outcomes are replicated by the model, is around 33.46%. While this indicates some level of predictive power, it also suggests there is significant room for improvement, as a higher R-squared value would be desirable.

# 9 Conclusions

Your conclusions, improvements, etc should go here

Conclusions:
Smile Detection:
Performance: The model achieves a decent accuracy of 79.37% in smile detection.
Strengths: It shows slightly better precision and recall for detecting smiles (Class1) than non-smiles (Class0), indicating effectiveness in recognizing smiling faces.
Balanced Metrics: The close alignment of precision, recall, and F1-scores for both classes suggests a balanced performance across both categories.
3D Head Pose Estimation:
MSE and R-squared: The model's Mean Squared Error (MSE) is reasonably low at 0.01763, but the R-squared value is at 33.46%, indicating moderate predictive accuracy.
Room for Improvement: The R-squared value suggests that the model is capturing some, but not all, of the variability in the 3D head pose data.
Suggested Improvements:

For Smile Detection:
Data Augmentation: More aggressive augmentation strategies could help the model generalize better, especially for the non-smiling class.
Model Complexity: Experimenting with deeper or more complex CNN architectures could enhance the model's ability to capture subtle features indicative of smiling or not smiling.

For 3D Head Pose Estimation:
Advanced Architectures: Implementing more advanced or deeper neural network architectures might capture the complexities of 3D head pose estimation more effectively.
Feature Engineering: Investigating additional preprocessing or feature engineering techniques to enhance the representativeness of the input data.
Hyperparameter Tuning: Systematic tuning of hyperparameters like learning rate, batch size, and the number of epochs could lead to improvements in model performance.
General Improvements:
Cross-Validation: Implementing k-fold cross-validation could provide a more robust evaluation of the model's performance.
Regularization Techniques: Applying regularization methods like dropout or L1/L2 regularization could help prevent overfitting, particularly for the 3D head pose estimation model.
Post-Model Analysis: A detailed analysis of misclassified instances or poorly estimated poses could provide insights into specific areas where the models are underperforming.

In [24]:
import os
import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from sklearn.metrics import classification_report

def Llabels(labels_path):
    """Load labels from a text file."""
    labels = np.loadtxt(labels_path)
    return labels

def Limages(directory, image_size=(64, 64)):
    """Load images from a directory, resize and convert them to numpy arrays."""
    image_files = [os.path.join(directory, file) for file in os.listdir(directory) if file.endswith('.jpg')]
    images = [Image.open(file).convert('RGB').resize(image_size) for file in image_files]
    images = np.array([np.array(image) for image in images])
    return images


# Set the dataset directory and paths for files and labels
dataset_dir = '.' 
files_dir = os.path.join(dataset_dir, 'files')
labels_file = os.path.join(dataset_dir, 'labels.txt')

# Load data
labels = Llabels(labels_file)
images = Limages(files_dir)
images = images/255

print(f"Loaded {len(images)} images and {len(labels)} labels.")

Loaded 4000 images and 4000 labels.


In [18]:
# Split data into training, validation, and test sets
images_train_, images_test, labels_train_, labels_test = train_test_split(images, labels[:,0], test_size=0.2, random_state=42)

# Define a simple CNN model
def create_custom_cnn():
    custom_model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
        MaxPooling2D(2, 2),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Conv2D(128, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(512, activation='relu'),
        Dropout(0.5),
        Dense(2, activation='softmax')
    ])
    return custom_model

custom_cnn_model = create_custom_cnn()

# Compile the model
custom_cnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
training_history = custom_cnn_model.fit(images_train_, labels_train_, epochs=20)

# Evaluate the model
test_loss, test_accuracy = custom_cnn_model.evaluate(images_test, labels_test)
print(f"Test accuracy: {test_accuracy}")

# Predictions
predictions_custom_cnn = custom_cnn_model.predict(images_test)


# Predict class labels on the test set
predicted_probabilities = custom_cnn_model.predict(images_test)
predicted_classes = np.argmax(predicted_probabilities, axis=1)

# Generate the classification report
class_report = classification_report(labels_test, predicted_classes, target_names=['Class0', 'Class1'])
print(class_report)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test accuracy: 0.793749988079071
              precision    recall  f1-score   support

      Class0       0.76      0.77      0.77       354
      Class1       0.82      0.81      0.81       446

    accuracy                           0.79       800
   macro avg       0.79      0.79      0.79       800
weighted avg       0.79      0.79      0.79       800



In [19]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def create_regression_cnn():
    model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
        MaxPooling2D(2, 2),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Conv2D(128, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(512, activation='relu'),
        Dropout(0.5),
        Dense(3)  # Output layer for 3D head pose regression (3 continuous values)
    ])
    return model

regression_cnn_model = create_regression_cnn()

# Compile the model for regression
regression_cnn_model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])

In [22]:
# Split the 3D head pose labels for training, validation, and test sets
pose_labels = labels[:, 1:]
pose_labels_train, pose_labels_test = train_test_split(pose_labels, test_size=0.2, random_state=42)
pose_labels_train, pose_labels_val = train_test_split(pose_labels_train, test_size=0.25, random_state=42)

# Train the model
regression_training_history = regression_cnn_model.fit(
    images_train, pose_labels_train, 
    epochs=20, 
    validation_data=(images_val, pose_labels_val)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [23]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate the model
test_loss, test_mse = regression_cnn_model.evaluate(images_test, pose_labels_test)
print(f"Test MSE: {test_mse}")

# Predict on the test set
predictions = regression_cnn_model.predict(images_test)

# Calculate R-squared (R²) for regression
r_squared = r2_score(pose_labels_test, predictions)
print(f"R-squared for 3D head pose prediction: {r_squared}")


Test MSE: 0.017634311690926552
R-squared for 3D head pose prediction: 0.33458403762706473
