# Emotion Recognition Classification Task 

## CS985 Deep Learning - GROUP S

#### Aishwarya Kurhe - 202382549                                                                 
#### Ayesha Gaikwad - 202363500
#### Krish Dasani - 202388555                                                                      
#### Santosh Bangera - 202366764

## 1. Overview

For the given classification task, machine learning models were developed, demonstrating the application of deep learning in emotion recognition from images. A RandomForest classifier was established as the baseline model, followed by exploration of deep neural network (DNN) architectures, integrating techniques such as data augmentation and early stopping to boost performance and address overfitting concerns. The best-performing DNN model was selected based on validation accuracy, proving its superiority over the baseline. Overall, the key learnings emphasize on the importance of thorough data preprocessing, selecting appropriate model architecture and evaluation strategies for developing effective machine learning models. Based on the modelling, possible recommendations include additional hyperparameter tuning and investigating more intricate model architectures to further optimize performance.

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split, cross_val_score
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score



## 2. Methodology

### 2.1 Exploring the data

To approach modelling for this task, the process initiated with an extensive analysis of the loaded dataset that provided a better understanding of its characteristics, such as class distribution, potential imbalances, and the nature of the images themselves. The dataset show grayscale images depicting facial expressions across seven emotion classes. While class distribution appeared balanced, some classes may have fewer samples. Images exhibit diverse expressions, lighting conditions, and orientations, indicating the complexity of emotion recognition.

In [2]:
# Load datasets
train_df = pd.read_csv('my_emotion_train.csv')
test_df = pd.read_csv('my_emotion_test.csv')



### 2.2 Data Preprocessing

Subsequently, the data underwent preprocessing, with pixel values originally provided as strings converted into NumPy arrays and normalised to a range between 0 to 1 to facilitate convergence during training. 
In regards to the methodology, one of the most important aspects was the separation of the dataset into three distinct sets: training, validation, and testing. This ensured that models were trained on a portion of the data, validated on another portion to monitor performance and hyperparameters, and ultimately evaluated on unseen test data. 

In [3]:
# Preprocess function to convert pixels from string to numpy array
def preprocess_pixels(pixel_str):
    return np.array([int(pixel) for pixel in pixel_str.split()]).reshape(48, 48, 1)

# Apply preprocessing
X_train = np.array([preprocess_pixels(x) for x in train_df['pixels']])
y_train = to_categorical(train_df['emotion'])
X_test = np.array([preprocess_pixels(x) for x in test_df['pixels']])

# Normalize pixel values
X_train = X_train / 255.0
X_test = X_test / 255.0



### 2.3 Data Augmentation

Given the limited size of the dataset, the utilization of data augmentation techniques was considered crucial. Techniques such as rotation, flipping, and zooming adjustment were employed to enhance model generalization and improve performance by introducing variability in the training data. By diversifying the dataset, these augmentation techniques aimed to give the model access to a bigger and more representative sample of data for learning.

In [4]:
# Split the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Data augmentation
datagen = ImageDataGenerator(
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True,
    fill_mode='nearest'
)
datagen.fit(X_train)



## 3. Models

Model selection for the task involved exploring both traditional machine learning approaches and deep neural networks (DNNs).

### 3.1 Standard ML Baseline Model

The RandomForest classifier was selected as the Standard ML Baseline model stems from its simplicity, effectiveness, and suitability for tabular data like the one in this dataset. RandomForest is an ensemble learning method based on decision trees, where multiple decision trees are trained on random subsets of the data and then aggregated to make predictions.
In this implementation, the RandomForestClassifier was instantiated with 100 estimators (decision trees) and a random state of 42 for reproducibility. The 'n_estimators' parameter controlled the number of decision trees in the forest, and a higher number generally led to better performance, albeit with increased computational cost.

Before fitting the RandomForest model, the input data was preprocessed to meet the classifier's specifications. Because RandomForest works with flattened input arrays rather than multidimensional arrays, the input data (X_train) was reshaped using'reshape()' to convert each image into a one-dimensional array while keeping the number of samples. Likewise, the target labels (y_train) were transformed with 'argmax(axis=1)' to return the one-hot encoded labels to their original integer format, which is suitable for RandomForest classification.

The RandomForest model's performance was evaluated using cross-validation. Cross-validation was a reliable technique for estimating a model's performance by dividing the dataset into multiple subsets (folds), training the model on one subset, and testing it on the remaining data. The 'cross_val_score' function from scikit-learn was used in conjunction with a 5-fold cross-validation strategy ('cv=5'). This meant that the dataset was divided into five equal-sized folds, and the model was trained and tested five times, with each fold serving as a validation set once.

The mean accuracy and standard deviation of the cross-validation scores were computed and stored in the 'model_performance' list, along with the model name ('RandomForest'). These metrics provided information about the model's average performance and variability across various folds. Thus, the RandomForest classifier was utilised as a baseline for comparing the performance of more complex models, such as deep neural networks, in subsequent analyses.

In [5]:
# Prepare for model comparisons
model_performance = []

# Standard ML Baseline with RandomForest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Flatten X for RandomForest
X_train_flat = X_train.reshape(X_train.shape[0], -1)
y_train_flat = y_train.argmax(axis=1)
# Cross-validation for RandomForest model
rf_cv_scores = cross_val_score(rf_model, X_train_flat, y_train_flat, cv=5)
model_performance.append({
    'Model': 'RandomForest',
    'Mean Accuracy': rf_cv_scores.mean(),
    'Std Deviation': rf_cv_scores.std()
})



### 3.2 Deep NN models

In the exploration of deep neural network (DNN) models, a sequential model architecture was adopted, featuring convolutional layers followed by max-pooling and batch normalisation. The chosen architecture, a Sequential model, was designed specifically for image classification tasks, beginning with convolutional layers that extracted features from input images. Each convolutional layer was combined with max-pooling to downsample feature maps and batch normalisation to achieve better stability and speed up training.

The model architecture commenced with a Conv2D layer employing 32 filters and a 3x3 kernel size, followed by a ReLU activation function. Subsequent layers deepened the network's capacity by increasing the number of filters in the convolutional layers (64 and 128), further enhancing its ability to discern intricate patterns within the images. Batch normalization after each convolutional layer aided in normalizing activations, promoting faster convergence and mitigating issues related to internal covariate shift.

Following the convolutional layers, the feature maps were flattened into a one-dimensional array to transition into fully connected layers. Here, two Dense layers were employed, each containing 256 neurons activated by ReLU, with a dropout layer to prevent overfitting. The final Dense layer had 7 neurons, representing the 7 emotion classes in the dataset, and utilized a softmax activation function to output class probabilities. During training, the model underwent optimization using the Adam optimizer with a learning rate set to 0.001 and minimized categorical cross-entropy loss. 

In [6]:
# Deep NN models exploration
val_accuracies = []

def create_deep_nn_model():
    model = Sequential([
        Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(48, 48, 1)),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Conv2D(64, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Conv2D(128, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Flatten(),
        Dense(256, activation='relu'),
        Dropout(0.5),
        Dense(7, activation='softmax')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
    return model



To reduce overfitting, two key strategies were applied during training. Early stopping was used to monitor validation accuracy, halting training if performance did not improve after a set number of epochs. This impeded the model from overmemorizing the training data, allowing it to retain generalisation capabilities. Moreover, a learning rate scheduler dynamically adjusted the learning rate during training, starting at a high value for rapid convergence and gradually decreasing to fine-tune model parameters as training progressed.

Five different model configurations were evaluated for optimal performance. This iterative process entailed adjusting various architectural components and training parameters, including convolutional layer configurations, dropout rates, and batch sizes. Each configuration was trained and evaluated on the validation set, with the model that achieved the highest validation accuracy being chosen for further analysis. By systematically refining the model architecture and training regimen, the goal was to develop a robust DNN model capable of accurately recognizing emotions from facial expressions across diverse settings and conditions.

In [7]:
best_val_accuracy = 0
best_model = None
for i in range(5):  # Example: trying 5 different configurations
    model = create_deep_nn_model()
    early_stopping = EarlyStopping(monitor='val_accuracy', patience=5)
    model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_accuracy', save_best_only=True)
    
    def scheduler(epoch, lr):
        if epoch < 10:
            return float(lr)
        else:
            return float(lr * tf.math.exp(-0.1))

    lr_scheduler = LearningRateScheduler(scheduler)
    history = model.fit(datagen.flow(X_train, y_train, batch_size=64), epochs=50,
                        validation_data=(X_val, y_val),
                        callbacks=[early_stopping, model_checkpoint, lr_scheduler])
    val_accuracy = history.history['val_accuracy'][-1]
    val_accuracies.append(val_accuracy)
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        best_model = model




  super().__init__(


Epoch 1/50


  self._warn_if_super_not_called()


[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 193ms/step - accuracy: 0.2620 - loss: 2.2697 - val_accuracy: 0.2895 - val_loss: 1.9834 - learning_rate: 0.0010
Epoch 2/50
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 167ms/step - accuracy: 0.3669 - loss: 1.6367 - val_accuracy: 0.4471 - val_loss: 1.4293 - learning_rate: 0.0010
Epoch 3/50
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 161ms/step - accuracy: 0.4164 - loss: 1.5184 - val_accuracy: 0.4016 - val_loss: 1.5329 - learning_rate: 0.0010
Epoch 4/50
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 158ms/step - accuracy: 0.4435 - loss: 1.4502 - val_accuracy: 0.4950 - val_loss: 1.3264 - learning_rate: 0.0010
Epoch 5/50
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 174ms/step - accuracy: 0.4720 - loss: 1.3785 - val_accuracy: 0.5059 - val_loss: 1.2928 - learning_rate: 0.0010
Epoch 6/50
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

The performance of the best deep neural network (DNN) model chosen during the exploration process was assessed by calculating its accuracy on the validation set and appending it, along with the standard deviation, to a list named 'model_performance'. A comparison table was then created from this list, outlining the performance metrics for each model, including the best DNN model. Subsequently, the best model's weights were loaded, and predictions were made on the test dataset, with the predicted classes saved along with the corresponding IDs into a submission file. This streamlined process encapsulated the evaluation and deployment of the top-performing DNN model for emotion recognition on the provided dataset.

In [8]:
# Assuming best model is selected, now add its performance to comparison
model_performance.append({
    'Model': 'Best DNN Model',
    'Mean Accuracy': np.mean(val_accuracies),
    'Std Deviation': np.std(val_accuracies)
})

# Constructing the comparison table
comparison_table = pd.DataFrame(model_performance)

# Output the comparison table
print(comparison_table)

best_model.load_weights('best_model.keras')
predictions = best_model.predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)

# Prepare submission file using the 'id' column from the test dataset
submission_df = pd.DataFrame({'id': test_df['id'], 'emotion': predicted_classes})
submission_df.to_csv('my_emotion_result_final.csv', index=False)

            Model  Mean Accuracy  Std Deviation
0    RandomForest       0.438405       0.007041
1  Best DNN Model       0.615690       0.016857
[1m216/216[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 19ms/step


## 4. Results

### 4.1 Table of results


##### RandomForest

Mean Accuracy - 0.438405

Standard Deviation - 0.007041

Validation Accuracy - 0.4503

##### Best DNN Model

Mean Accuracy - 0.615690 

Standard Deviation - 0.016857

Validation Accuracy - 0.6324

Validation Accuracy without Augmentation - 0.582931


For best results, two methods for emotion recognition have been studied, including a deep neural network (DNN) architecture and the RandomForest baseline model. The RandomForest model achieved a mean accuracy of 0.438405 with a standard deviation of 0.00704 through cross-validation. However, the DNN approach, with five different configurations tested, outperformed the baseline, with the best model achieving a mean accuracy of 0.615690 and a standard deviation of 0.016857.

While experimenting with the DNN models, various configurations were explored, including different convolutional layer depths, pooling strategies, and regularization techniques like dropout. The introduction of data augmentation techniques further enhanced model performance by providing additional training data, since the validation accuracy was lower without it. Early stopping and learning rate scheduling were also employed to prevent overfitting and optimize training dynamics.

Despite the successes, challenges were encountered during the experimentation process. Some configurations may have led to overfitting or failed to converge effectively, highlighting the delicate balance required in optimizing DNN architectures. In addition, hyperparameter tuning had the potential to improve performance even further, but extensive experimentation was limited by computational resources.

In its entirety, the results affirm the potential of deep learning methodologies, particularly convolutional neural networks, in accurately discerning emotions from facial expressions, laying a foundation for continued advancement in emotion recognition research.

## 5. Summary

As both of the baseline Random Forest classifier and deep neural network (DNN) models were tested, the best-performing DNN model was opted based on validation accuracy and utilized for predicting emotions on the test dataset. Hence, the recommended model for this task would be the deep neural network (DNN) model, due to its ability to capture intricate patterns in facial expressions through convolutional layers. The selected DNN model demonstrates superior performance compared to the baseline Random Forest classifier, as indicated by its higher mean accuracy in the validation set. 

Similarly, the DNN model turned out to be the best performing technique on the Kaggle submission, with a test accuracy (Kaggle score) of 0.62. Contrary to this, the Standard baseline Random Forest classifier model merely achieved a test accuracy score of 0.50.

Multiple strategies can be contemplated in order to enhance performance in this task even more. Better outcomes might be obtained by experimenting with more intricate DNN architectures, modifying hyperparameters like learning rate and dropout rates, and looking into advanced techniques like transfer learning. Beyond that, expanding the variety and magnitude of the dataset and modifying the parameters for data augmentation could improve the model's capacity to generalise and adapt to a variety of facial expressions and contextual factors. Regular monitoring and updating of the model with new data can also contribute to continuous improvement in performance over time.