# Deep Facial Expression Recognition: A Survey

#### Shan Li and Weihong Deng

## Abstract

1. Transition of facial expression recognition (FER) from  laboratory controlled to challenging in-the-wild conditions
2. Deep neural nets used to learn discriminative representations for automatic FER.
3. Focus of recent systems
    - Overfitting because of lack of sufficient training data.
    - Expression unrelated variations such as:
        - Illumination
        - Head Pose
        - Identity Bias
4. In this paper
    - Available datasets
        - Accepted data collection principles
        - Accepted evaluation principles
    - Standard pipeline of a deep FER system.
        - Related background knowledge
        - Suggestions of applicable implementations
    - SotA in deep FER
        - Strategies defined for
            - Static 
            - Dynamic image sequences
        - Their advantages and disadvantages.
        - Performances on widely used benchmarks
5. Additional related issues
6. Application Scenarios
7. Remaining challenges and corresponding opportunities
8. Future direction for the design of robust deep FER systems

## 1. Introduction

1. Facial expressions are universal signals to convey their emotional states and intentions.
2. Importance in sociable robotics, medical treatment, driver fatigue, surveillance, other HCI systems.
3. FER systems ave been explored to encode expression information from facial representations.
4. Ekman and Friesen defined six basic emotions - anger, disgust, fear, happiness, sadness, surprise, contempt(recently added [5]).
5. Advanced research on neuroscience and psychology argued mode of six basic emotions are culture specific and not universal.

**More things are discussed including what is covered by other surveys (not deep) read everything in highlight again and add to the notes, only taking down things that are relevant to take action for now.**

Some challenges that need to be addressed are:
- Subject identity biasHigh inter subject variations due to age gender, ethnic backgrounds, level of expressiveness.
- Variations in pose
- Variations in illumination
- Occlusions
- Large intra class variability.


Further discussed in this paper:
- Section 2: Datasets
- Section 3: Three Main steps in a deep FER system
- Section 4: Special network tricks for static and dynamic image sequences.
- Section 5: Additonal related issues and other practicall scenarios.
- Section 6: Challenges nad opportunities, potential future directions.

## 2. Datasets

## 3. Deep Facial Expression Recognition

3 main steps in automatic deep FER:
- Pre processing 
- Deep feature learning
- Deep feature classification

### 3.1 Pre-processing 

Variations irrelevant to facial expression recognition such as *different backgrounds, illuminations and head poses* are present. Before training the deep neural network to learn meaningful features, preprocessing required to **align and normalize the visual semantic information** conveyed by the face.

#### 3.1.1 Face Alignment

#####  **Detect and align face, remove non face areas**

Face detection
 - Viola Jones is commonly used   // *comment read about viola jones - how to compare its accuracy and computation complexity(fact - its good for near frontal faces)*
 - 

Face Alignment
- Using coordinates of localized landmarks **reduces variation in face scale** and in **plane rotation**.
- Methods(Table 2 compares the performance):
    - Active Appearance model (AAM) is a generative model that optimizes the required parameters form holistic facial appearance and global shape patterns.
    - In dicriminative models, mixture of trees (MoT) structured models and the discriminative response map fitting (DRMF) use part based approaches that represent the face via the local appearance information around each landmark.
    - A number of discriminative models directly use a cascade of regression functions to map the image appearance to landmark locations and have shown better results. eg- the supervised descent method (SDM, implemented in IntraFace).
    
    **More approaches there make notes later, first make a baseline with the naive triangulation that you have chosen nowm then extend with what's used in OpenFace adn IntraFace etc.**

#### 3.1.2 Data Augmentation

Data augmentation is vital for successful generalization. There are two kinds of augmentations possible:
- On the fly data augmentation
   - Usually used in deep nets to avoid overfitting, during the training step input samples are randomly cropped(from 4 corners and center) and then flipped horizontally, leads to ten times larger dataset.
   - Two common prediction modes are adopted, only center patch is used for  predicting or the prediction value is averaged over all ten crops.
- Offline data augmentation
    - Frequently used operations include random perturbations and transforms eg - shifting, skew, scaling, noise, contrast and color jittering. Common noise models, salt and pepper, speckle noise, gaussian noise.
    - GAN
    - **Other ways, note later.**

#### 3.1.3 Face Normalization


To account for variations in **illumination** and **pose**

##### 1. Illumination Normalization

Varying illumination causes large intra-class(same expression) variance.
Various algorithms analysed. See paper highlights and **copy here**.

##### 2. Pose Normalization

Normalization techniques to yield fronal facial views for FER.
- Most Popular [92 - Hassner et al.]; localise facial landmarks, a 3D texture reference model generic to all faces generated to efficiently estimate visible facial components. Initial frontalized face is synthesised by back projecting each input face image to the reference coordinate system.
- [93 - Sagonas et al.] proposed effective statistical model to simultaneously localize landmarks and convert facial poses using only frontla faces. 
- Recently GAN techniques for frontal view synthesis such as FF-GAN, TP-GAN< DR-GAN proposed.

### 3.2 Deep networks for feature learning

#### 3.2.1 Convolutional Neural Network (CNN)

#### 3.2.2 Deep Belief Network (DBN)

#### 3.2.3 Deep Autoencoder (DAE)

#### 3.2.4 Recurrent Neural Network (RNN)

#### 3.2.5 Generative Adversarial Network (GAN)

### 3.3 Facial Expression Classification

Unlike traditional methods where the feature extraction step and the feature classification step are independent, deep networks can perform FER in an end to end way (loss layer at the end).

Alternatively, using SVMs (theoretically demonstrated in 130) /random forest is also useful as in [133, 134].

Another approach  showed that the covariance descriptors computed on DCNN features and classification with Gaussian kernels on Symmetric Positive Definite (SPD) manifold are more efficient than the standard classification with the softmax layer.

## 4. The State of the Art

Literature divided into two main groups depending on the type of data. Deep FER networks for:
    - Static images
    - Dynamic Image Sequences
    
Overview of current deep FER systems wrt network *architecture* and *performance*.

### 4.1 Deep FER networks for static images

Temporal information considered due to convenience of data processing and availability of the relevant training data and test material.
Things covered:
- Pre - training and fine tuning skills for FER
- Review novel deep neural networks
- Table 4 shows the current state of the art methods in the field that are explicitly conducted in a person independent protocol.

#### 4.1.1 Pre-training and fine-tuning

Direct training of deep networks on relatively small datasets is prone to overfitting.

- Many studies used **additional task-oriented data to pre-train** their self built networks.

- **Pre-training on larger FR data** positively affects the emotion recognition accuracy, **further fine-tuing with additional FER datasets** can help improve the performance

-  Instead of directly using the pre-trained or fine-tuned models to extract features on the target dataset, a **multistage fine-tuning
strategy** [63] (see “Submission 3” in Fig. 3) can achieve better
performance:
    - After first stage fine tuning using FER 2013 pre trained model
    - Second stage fine-tuning based on the training part of the target dataset (EmotiW) employed to refine the models to adapt to a more specific dataset(target dataset).

- Although pre-training and fine-tuning on external FR data can
indirectly avoid the problem of small training data, the networks
are trained separately from the FER and the face-dominated
information remains in the learned features which may weaken
the networks ability to represent expressions. To eliminate this
effect, a two-stage training algorithm FaceNet2ExpNet [111] was
proposed (see Fig. 4). The fine-tuned face net serves as a good
initialization for the expression net and is used to guide the
learning of the convolutional layers only. And the fully connected
layers are trained from scratch with expression information to
regularize the training of the target FER net.

#### 4.1.2 Diverse Network Input

#### 4.1.3 Auxiliary blocks and layers

#### 4.1.4 Network Ensemble

#### 4.1.5 Multitask Networks

#### 4.1.6 Cascaded Networks

#### 4.1.7 Generative Adversarial Networks (GANs)

#### 4.1.8 Discussion

Summarisation of the approaches and corresponding reasons.

### 4.2 Deep FER networks for dynamic image sequences

#### 4.2.1 Frame Aggregation

#### 4.2.2 Expression Intensity Network

#### 4.2.3 Deep Spatio-temporal FER network

#### 4.2.4 Discussion

### 5 Additional Related Issues

#### 5.1 Occlusion and non-frontal head pose

#### 5.2 FER on infrared data

#### 5.3 FER on 3D static and dynamic data

#### 5.4 Facial Expression Synthesis

#### 5.5 Visualisation Techniques

#### 5.6 Other special issues

## 6 Challenges and Opportunities

### 6.1 Facial Expression datasets

### 6.2 Incorporating other affective models

### 6.3 Dataset bias and imbalanced distribution

### 6.4 Multimodal Affect Recognition