# Training and Evaluation Protocol

| Paper                         | Modalities                        | DL approaches                             | Metrics                                         |
| -----------                   | -----------                       | -----------                               | -----------                                     |
| Yang et al. (2021)            | Video (facial + recovered rPPG)   | MTCNN (face crop), Inflated 3D ConvNet    | AUC=72.3, Acc(bin)=78.9, Acc(all)=35.0          |
| Wu et al. (2021)              | Video (RGC, optical flow, pose)   | I3D, CNN + LSTM                           | ----                                            |
| Bargshady et al. (2020)       | Video (facial)                    | VGG_faces + PCA, CNN+RNN, Ensemble        | AUC=93.67 MSE=0.081 MAE=0.103                   |
| Xin et al. (2020)             | Video (facial)                    | STN, Attention mechanism, CNN             | Accuracy=51.06 MSE=1.1014                       |
| Lopez-Martinez et al. (2018)  | Physiological (SC, ECG)           | Unidirectional LSTM                       | MAE=1.05, RMSE=1.29, R2=0.24                    |
| Rodriguez et al. (2017)       | Video (facial)                    | VGG-16 (VGG_faces) + LSTM                 | AUC=93.3, MAE=0.5, MSE=0.74, PCC=0.78, ICC=0.45 |
| Lopez-Martinez et al. (2017)  | Physiological (SC, ECG)           | NN with two FC layers                     | Acc(bin)=82.75                                  |
| Zhou et al. (2016)            | Video (facial)                    | RCNN                                      | MSE=1.54, PCC=0.65                              |

## Summaries of papers with relevant setups

### Yang, Ruijing, et al. "Non-contact Pain Recognition from Video Sequences with Remote Physiological Measurements Prediction." arXiv preprint arXiv:2105.08822 (2021).

*The objective of the framework is to learn an enriched facial representation that can distinguish different pain intensities. Paper presents rPPG-enriched Spatio-temporal Attention Network (rSTAN) to process a video snippet in two branches: a rPPG recovery branch (Deep-rPPG) to recover rPPG signals as an auxiliary task and a facial representation learning branch (STAN). A Visual Feature Enrichment (VFE) module attempts to complement the facial representation by considering the spatial ROIs from rPPG features with focus on time steps that contain physiological variations related to pain.*

*Not a purely multi-modal approach since only video data are used as input.* 

**Proposed system**

Experimental setup details: Nvidia P100 using PyTorch.

1. Original video is downsampled to L = 64

2. MTCNN detects and crops the facial area.

3. Framework built upon Inflated 3D ConvNet. 

4. For parameter selection, all subjects are randomly split into 5 folds and 5-fold CV is used to determine the best parameters.

5. Adam is used as the optimizer with a LR of 2e-4 which is decayed after 10 epochs with gamma = 0.8.

**Evaluation and Comparison**

No similar multi-task framework in literature until now, comparison of results to those with multi-modal fusion schemes. Those multi-modal schemes all use early fusion because rSTAN also performs at feature level, not decision level.

Performance also compared to uni-modal (appearance-based) approaches.

Comparisons to state-of-the-arts following LOSO protocol.

### Wu Q., Zhu A., Cui R., Wang T., Hu F., Bao Y., Snoussi H. Pose-guided inflated 3D ConvNet for action recognition in videos Signal Process., Image Commun., 91 (2021), Article 116098, 10.1016/j.image.2020.116098

*Pose-Guided Inflated 3D ConvNet network for video action recognition which captures RGB images and optical flow using I3D and also extract the skeleton sequence from each video by a pose-based module. The pose features are trained with the RGB images and optical flows respectively, and their results are weighted.*

**Proposed system**

Implemented using TensorFlow, experiments on 2 GTX 1080Ti GPUs.

1. All videos are processed at 25 frames per second. Video frames are randomly cropped and resized to 224x224. The model is trained and tested using 50-frame video snippets => the number of RBG images, optical flow images and skeleton sequences = 50 for each video. For shorter videos, the last frame is duplicated to match the input size of the model.

2. Optical flow is computed by the TV-L1 algorithm. 

3. RGB image and optical flow is used to train I3D ConvNet. 

4. The post-based module is divided into two parts: a CNN learning the spatial features (pairwise distance between joints), and a hierarchical LSTM paying attention to the temporal information (position of whole body). The skeleton coordinates are normalized. 

5. Batch normalization operation and ReLU activation function for each convolutional layer except the last. Dropout is used on LSTM and FC layers. Loss function Ltotal is minimized using SGD optimization, momentum = 0.9, initial LR = 0.01 and decay by 0.1 after every 50 epochs.

6. In order to benefit from both pose module and I3D model, the last convolutional layer of I3D is combined with the pose-based model, followed by a fully connected layer. Two cross-entropy losses are introduced to process both appearance feature and pose feature at the same time. Best fusion strategy was RGC + flow + pose.

**Evaulation and Comparison**

Performance is compared against state-of-the-art with mean accuracy. The accuracy is the best received and higher than another similar project which aggregated the features of a joint heat map, RGC image and optical flow - the reason being that pose information represented by a skeleton sequence is more accurate than a heat map (interesting!).

Failure cases: the performance of the framework is greatly affected by the accuracy of the pose estimation. + They had insufficient facial annotations which is a problem as the model couldn't capture subtle expressions in some areas of the face.

### Bargshady, Ghazal, Zhou, X, Deo, Ravinesh C, Soar, Jeffrey, Whittaker, Frank and Wang, Hua (2020) Ensemble neural network approach detecting painintensity from facial expressions. Artificial Intelligence in Medicine, 109. ISSN 0933-3657 

*Proposing an ensemble deep learning model (EDLM) consisting of two steps: feature extraction as early fusion (VGGFace + PCA) and classification as late fusion using different CNN+RNN models, and their outputs are merged into the resulting classification.*

**Proposed system**

Implemented using Keras and TensorFlow.

1. Preprocessing and normalization are applied on the data. The dataset was balanced by under-resampling technique to reduce the majority class. Noise and background was removed from each video frame. OpenCV face recognition algorithm was used for face detection, then frame was cropped and centralized, and lastly normalized the pixel values for both train and test datasets (rescaled to a range [0,1] - this included converting data type from integer to floats and splitting the pixel values by the highest value). The pre-processed data was reshaped to 224x224x3 dimensions.

2. The pre-processed data is transferred to extract features using an early fusion section. This section contains a fine-tuned VGGFace model and its output is combined with PCA to achieve DR (and preferably choose the best set of data dimensions). This model ran by 50 epoch and 48 batches.

3. Three independent and hybrid CNN+RNN DL methods outputs are merged to classify pain intensity. The networks are developed using different parameter, weight and architecture. DNN1 and DNN2 contain two CNNs with Conv2D architecture which their output shift in stack way to a BiLSTM, however, they have different weighting. DNN3 has a different architecture, a CNN with Conv1D and output is transferred into a LSTM. This section performed by 5 epoch and 48 batches.

How was the LSTM networks outputs merged?? Adaptive weighting? Was it a stacking ensemble?...

**Evaulation and Comparison**

Pain is classified in multi-levels (5 classes) from facial expression video frames.

CV is repeated 10 times.

The experimental results indicated that using hybrid CNN+RNN in late fusion (classification) was more accurate results than only RNN.

Performance is evaluated using Classification MAE, MSE, Accuracy, AUC, F-score, TP/TN/&FP/FN, TPR/FPR.

The total time complexity for the UNBC-McMaster Shoulder Pain DS is 5900s and for the MIntPAIN DS it's 41700s. The feature extraction section was the most time-consuming part. Adding more streams did not affect the algorithm speed and efficiency.

### Xin X, Lin X, Yang S, Zheng X (2020) Pain intensity estimation based on a spatial transformation and attention CNN. PLoS ONE 15(8): e0232412. https://doi.org/10.1371/journal.pone.0232412

*Trying to solve the problems of background interference and facial region adaptive distribution weights. The estimation pipeline consists of five modules: the input image is provided to a STN (Spatial Transformation Network) module for addressing background interference, then the attention mechanism is used to distrbute different weights of different face regions. After that, the attentional face image is input into the CNN module to extract feature descriptions and finally, the outputs of the CNN module is measured by the softmax function which is used in the back-propagation process.*

**Proposed system**

1. The face image is performed a 2D affine transformation (translation, cropping, rotation, scaling, skewing) and normalized to [0, 1].

2. The normalized face image is used as input into the STN to perform a geometric transformation. 

3. After STN, each channel of transformed color face images is fed into the module of attention mechanism to obtain the self-learned weights of different regions. These transformed images are multiplied by the self-learned weights for computing attentional face images. 

4. The CNN module consist of convolutional and ReLU layers. It's purpose is to extract self-learned features from the attentional face image. 

5. After the CNN, a FC layer with four neurons is introduced and using softmax loss function to measure the estimation error. 

6. Pain intensity is classified based on the probability value of the neuron output of FC layer.

**Evaulation and Comparison**

Comparisons were made between the performances of the different modules, and it clearly showed that CNN with STN achieved the best results. Thererfore, STN and attention mechanism is combined in the proposed method.

The proposed method only analyzes still face images and does not use the motion information.

### D. Lopez-Martinez and R. Picard, "Continuous Pain Intensity Estimation from Autonomic Signals with Recurrent Neural Networks," 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2018, pp. 5624-5627, doi: 10.1109/EMBC.2018.8513575.

*Conducted experiments with traditional ML approaches such as LR, SVMs and non-recurrent NN to compare with recurrent NNs, two approaches: fully-connected RNNs where the output is to be fed back to input, and LSTM.*

**Proposed system summary**

Implemented using TensorFlow and Keras.

1. Datasets were balanced by downsampling the over-represented category - between no-pain vs pain levels as well as among the four different pain levels.

2. Early stopping and dropout are applied as regularization. 

3. The best performing architecture was an LSTM that uses N = 10 overlapping windows. It's unidirectional with one FC layer and ReLU activation.

**Evaulation and Comparison**

Networks are evaluated on their ability to distinguish between no pain (baseline; BLN) and the maximum pain level applied (P4).

Results indicate that SC features are best performing in terms of binary classification accuracy.

Switched between binary classification and the regresson problem of continuously estimate pain intensity for the entire sequences with a step size of 0.5 seconds. Regression evaluation metrics were MAE, RMSE and coefficient of determination (R2). 

LSTM-NN algorithm was the best performing of all metrics.

### P. Rodriguez et al., "Deep Pain: Exploiting Long Short-Term Memory Networks for Facial Expression Classification," in IEEE Transactions on Cybernetics, doi: 10.1109/TCYB.2017.2662199.

*End-to-end (it learns to extract features and also learns to use them to predict the level of pain) DL approach. Approach is based on CNN and apply temporal modeling using LSTM onto the featured learned from the VGG_faces network.*

**Proposed system summary**

1. Cropped images using facial landmarks, then frontalized. Used generalized Procrustes analysis (GPA) to align the landmarks. 

2. Imbalanced dataset (8K pain frames and 40K labeled as no pain). => The training data was balanced by randomly under-sample the majority class (i.e. no-pain) + complemented the results by giving normalized scores (balancing the validation data).

3. Target preprocessing: Using MSE as it's very sensitive and most suited for cases where Gaussian noise is present. Labels (the pain levels) are standardized before training. 

4. Data augmentation: flipping images with 50% probability, adding random noise to the reference landmarks before performing piece-wise affine warping (i.e. introducing small deformations to faces).

5. Train a CNN to perform pain level recognition task: fine-tuning VGG-16 CNN pretrained with faces ("Deep face recognition" by Parkhi et al.). The CNN_faces model was trained on raw images of faces with some background but in this experiment, the background was black => need to compensate their differences by subtracting the per-pixel mean (NOT global pixel mean). Used L2 between predicted label y_hat and ground truth label y (instead of log-likelihood).

6. Using temporal information is achieved by extracting the features from the CNN fc6 layer and train a LSTM. The CNN processes each frame and the outputs of fc6 is a low-dimensional feature vector for each image. M feature vectors have to be grouped together in sequences of length p created so that each frame is the last of a sequence once. Each sequence is labeled with one label corresponding to the label of the last frame of the sequence. As each sequence only has one label, the hidden state of the last time-step is used to compute the output of the network. The LSTM input data needs to be balanced at the sequence level (not frame level) so that no frames are skipped. Frames need to be sorted in time, split them in sequences, and discard entire sequences with no pain in all their frames until they match the number of sequences with pain. 

**Evaluation and Comparison**

Results are compared against continuous prediction models with the intraclass correlation coefficient (ICC), pearson correlation coefficient (PC), MSE and MAE. Aggregated the pain levels so that 4 and 5 are merged and 6+ becomes the 5th level.

A binary threshold is also needed to compare in the case of binary accuracy. Performance is evaluated with skew normalized accuracy and AUC scores (to mitigate the effect of imbalanced test data) on leave-one-subject-out CV since subject-exclusiveness increases the confidence that the model will behave similarly with new data. Accuracies are reported with a threshold of [0,1) for no-pain and [1,∞) for pain. 

Most of the mistakes in model performance is due to frontier effects. 

Considering the raw image and temporal information at the pixel level allowed the model to outperform the results obtained by previous canonical normalized appearance approaches.

### Lopez-Martinez, D.; Picard, R. Multi-task neural networks for personalized pain recognition from physiological signals. InProceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops andDemos (ACIIW), San Antonio, TX, USA, 21–27 October 2017; pp. 181–184. 

*Conducted experiments by testing single-task classifiers (logistic regression and SVM) performances against multi-task neural network. Multi-task learning accounts for individual differences in pain responses while still learning data from across the population.*

**Proposed system summary**

Implemented using TensorFlow and Keras. 

1. Only used SC and ECG signals as input since they can be obtained from wrist-sensors.

2.  Multi-task learning involves simultaneously training related tasks over shared representations. It contains the M sigmoid classifiers, one for each task, and optimization of the corresponding loss functions is done simultaneously. The NN consists of an input layer, two FC hidden layers: one person-specific hidden layer with one task defined for each subject in the dataset, and one hidden layer shared between all tasks (hard parameter sharing). 

3. Upper bound constraint on the norm of the network weights, dropout and early stopping applied.

**Evaluation and Comparison**

Classification accuracy is estimated via 10-fold CV. Results indicate SC features significantly outperform ECG features, and multi-task achieves best performance.

### J. Zhou, X. Hong, F. Su and G. Zhao, "Recurrent Convolutional Neural Network Regression for Continuous Pain Intensity Estimation in Video," 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016, pp. 1535-1543, doi: 10.1109/CVPRW.2016.191.

*RCNN predicts pain intensity (PSPI) predictions frame by frame. Average MSE and PCC are calculated for video sequences.*

**Proposed system summary**

Theano framework. Experiments carried out on NVIDIA Tesla K80 GPU.

1. Each frame from a video sequence is aligned and warped to the same frontal pose.
2. Network input is a 3-channel HxW (30x713) = 713x30x3 frame vector sequences. RCNN requires fixed height input => a sliding window is applied. Each frame was flattened into a feature vector (might loose some structural information but preserves temporal information), and all 1D flattened warped facial images are concatenated in frame order to achieve frame vector sequences.
3. In each RCL, one convolutional layer is first used and then connected three iterations (T = 3) following the feed-forward layer. In the CL, a linear function is used as the activation to conduct the regression task and MSE function as loss measurement. Output FC is a softmax layer.
4. Training is performed by minimizing the MSE function using back-prop through time BPTT algorithm (equivalent to using standard BP algorithm on time-unfolded network). LR = 1/1000 of its initial value. Momentum was fixed at 0.9. Weight decay and dropout. Batch normalization following the first CL and every RCLs to accelerate training.
5. The network will output the PSPI predictions frame by frame.

**Evaluation and Comparison**

LOSO strategy => 25-fold CV (all sequences of one chosen subject was left as the testing set, and the rest sequences as training set). 

Average MSE and pearson product-moment correlation coefficient (PCC) were calculated for total number of frames y_hat and ground truth y and the pain intensity estimation of the ith frame. 

Average testing time is 25 frame per second (efficient for real-time?).