# Training and Evaluation Protocol

| Paper                     | Modalities                        | DL approaches used                      |
| -----------               | -----------                       | -----------                             |
| Yang et al. (2021)        | Video (facial + recovered rPPG)   | MTCNN (face crop), Inflated 3D ConvNet  |
| Rodriguez et al. (2017)   | Video (facial)                    | VGG-16 (VGG_faces) + LSTM               |
|                           |                                   |                                         |

## Summaries of papers with relevant setups

### Yang, Ruijing, et al. "Non-contact Pain Recognition from Video Sequences with Remote Physiological Measurements Prediction." arXiv preprint arXiv:2105.08822 (2021).

*The objective of the framework is to learn an enriched facial representation that can distinguish different pain intensities. Paper presents rPPG-enriched Spatio-temporal Attention Network (rSTAN) to process a video snippet in two branches: a rPPG recovery branch (Deep-rPPG) to recover rPPG signals as an auxiliary task and a facial representation learning branch (STAN). A Visual Feature Enrichment (VFE) module attempts to complement the facial representation by considering the spatial ROIs from rPPG features with focus on time steps that contain physiological variations related to pain.*

*Not a purely multi-modal approach since only video data are used as input.* 

**Proposed system**

Experimental setup details: Nvidia P100 using PyTorch.

1. Original video is downsampled to L = 64

2. MTCNN detects and crops the facial area.

3. Framework built upon Inflated 3D ConvNet. 

4. For parameter selection, all subjects are randomly split into 5 folds and 5-fold CV is used to determine the best parameters.

5. Adam is used as the optimizer with a LR of 2e-4 which is decayed after 10 epochs with gamma = 0.8.

**Evaluation and Comparison**

No similar multi-task framework in literature until now, comparison of results to those with multi-modal fusion schemes. Those multi-modal schemes all use early fusion because rSTAN also performs at feature level, not decision level.

Performance also compared to uni-modal (appearance-based) approaches.

Comparisons to state-of-the-arts following LOSO protocol.

### P. Rodriguez et al., "Deep Pain: Exploiting Long Short-Term Memory Networks for Facial Expression Classification," in IEEE Transactions on Cybernetics, doi: 10.1109/TCYB.2017.2662199.

*End-to-end (it learns to extract features and also learns to use them to predict the level of pain) DL approach. Approach is based on CNN and apply temporal modeling using LSTM onto the featured learned from the VGG_faces network.*

**Proposed system summary**

1. Cropped images using facial landmarks, then frontalized. Used generalized Procrustes analysis (GPA) to align the landmarks. 

2. Imbalanced dataset (8K pain frames and 40K labeled as no pain). => The training data was balanced by randomly under-sample the majority class (i.e. no-pain) + complemented the results by giving normalized scores (balancing the validation data).

3. Target preprocessing: Using MSE as it's very sensitive and most suited for cases where Gaussian noise is present. Labels (the pain levels) are standardized before training. 

4. Data augmentation: flipping images with 50% probability, adding random noise to the reference landmarks before performing piece-wise affine warping (i.e. introducing small deformations to faces).

5. Train a CNN to perform pain level recognition task: fine-tuning VGG-16 CNN pretrained with faces ("Deep face recognition" by Parkhi et al.). The CNN_faces model was trained on raw images of faces with some background but in this experiment, the background was black => need to compensate their differences by subtracting the per-pixel mean (NOT global pixel mean). Used L2 between predicted label y_hat and ground truth label y (instead of log-likelihood).

6. Using temporal information is achieved by extracting the features from the CNN fc6 layer and train a LSTM. The CNN processes each frame and the outputs of fc6 is a low-dimensional feature vector for each image. M feature vectors have to be grouped together in sequences of length p created so that each frame is the last of a sequence once. Each sequence is labeled with one label corresponding to the label of the last frame of the sequence. As each sequence only has one label, the hidden state of the last time-step is used to compute the output of the network. The LSTM input data needs to be balanced at the sequence level (not frame level) so that no frames are skipped. Frames need to be sorted in time, split them in sequences, and discard entire sequences with no pain in all their frames until they match the number of sequences with pain. 

**Evaluation and Comparison**

Results are compared against continuous prediction models with the intraclass correlation coefficient (ICC), pearson correlation coefficient (PC), MSE and MAE. Aggregated the pain levels so that 4 and 5 are merged and 6+ becomes the 5th level.

A binary threshold is also needed to compare in the case of binary accuracy. Performance is evaluated with skew normalized accuracy and AUC scores (to mitigate the effect of imbalanced test data) on leave-one-subject-out CV since subject-exclusiveness increases the confidence that the model will behave similarly with new data. Accuracies are reported with a threshold of [0,1) for no-pain and [1,∞) for pain. 

Most of the mistakes in model performance is due to frontier effects. 

Considering the raw image and temporal information at the pixel level allowed the model to outperform the results obtained by previous canonical normalized appearance approaches.