## 3D face reconstruction from low-resolution images

### Angelo Garangau Menezes - N°USP: 11413492

#### Abstract:
Since the introduction of models that could reconstruct the dense 3D geometry of a face from a single image, there has been a trend in academia for models that could enhance this capability in high-resolution scenarios. However, the reconstruction event is also important for situations where images in low-resolution are the majority, such as in general surveillance.
Most of the software that present features related to 3D face reconstruction usually include 3D morphable model fitting which may be computationally expensive and often do not bring enough details, mainly when the image to be used in the first place is a low-resolution one.

This project evaluated different resolution scenarios and explored a deep volumetric regression network to infer 3D depth maps directly of images of faces.

#### Motivation

![title](report_images/motivation.png)

### Dataset

The 300 W-LP dataset was used for training the super-resolution CNN algorithm and the face reconstruction algoritm. The dataset contains 3837 images of faces and their volume mappings. It has been produced by fitting a 3D morphable model to unconstrained images of the 300W large pose dataset using a multi-feature fitting approach.

![title](report_images/image_samples.png)

#### A summary of the pipeline evaluated can be seen in the figure below

![title](report_images/3D-Face-Reconstruction.png)

### Experiment 1 - 32x32 -> 128x128x200

Firstly, all the images in the database were resized to 32x32 and 128x128 in order to create pairs for mapping low-resolution inputs to targets. The shape 128x128 was chosen because it was the shape of input images for the 3D reconstruction model.
A small comparison between resolutions can be seen in the image below.

![title](report_images/32-128_comparison.png)

Then, the mapping to a higher resolution was obtained by using the following algorithms:

- Bilinear Interpolation
- Bicubic Interpolation
- SubPixel CNN (Shi et al, 2016)

![title](report_images/sub-pixel-cnn.jpg)

The subpixel CNN for this experiment was trained with the following parameters:
- Batch size: 32
- Number of epochs: 300
- Loss: MSE
- Optimizer: Adam
- Learning Rate = 0.01

### Experiment 2 - 64x64 -> 128x128x200

Following the same idea of the first experiment, images were resized to a lower resolution and then up-scaled to check if the 3D reconstruction would work properly.
A small comparison of the difference between 64x64 and 128x128 can be seen below.

![title](report_images/64-128_comparison.png)

For this experiment, the subpixel CNN was trained with the following parameters:
- Batch size: 64
- Number of epochs: 500
- Loss: MSE
- Optimizer: Adam
- Learning Rate = 0.01 with decay to 10% every 100 epochs

### 3D Face Reconstruction Algorithm Evaluated
#### - Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression (Jackson et al, 2017)

![title](report_images/Hourglass-Net.png)

The volumetric regression network took as input an image of 128x128 and its output was a 128x128x200 depth map.
The network architecture and parameters were extracted from their original paper. The settings were the following:
- Batch size: 4
- Number of epochs: 55
- Loss: Cross-Entropy
- Optimizer: RMSProp
- Learning Rate: 0.001 with decay of 90% every 5 steps

#### Evaluation Criteria
##### Super-Resolution
- PSNR (Peak Signal-to-Noise Ratio)
$= 10\log_{10}(\frac{S^{2}}{MSE})$

A metric obtained by the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation.



- SSIM (Structural Similarity)
$= \frac{(2 \mu_{x} \mu_{y} + c_{1})(2\sigma_{xy} + c_{2})}{(\mu_{x}^2 + \mu_{y}^2 + c_{1})(\sigma_{x}^2 + \sigma_{y}^2 + c_{2})}$

A perceptual metric that quantifies image quality degradation based on the measurement of the perceptual difference between two similar images.

Face Reconstruction
- NME (Normalised Mean Error)$=\frac{1}{N}\sum_{k=1}^{N}\frac{\left \| x_{k} - y_{k} \right \| _{2}}{d}$

Average per vertex of the Euclidean distance between the estimated and ground truth reconstruction normalised by the outer 3D interocular distance



### Results

#### Up-Sampling 4x

Original 32x32

![title](report_images/orig-32.png)

Bilinear Interpolation

![title](report_images/lin-32-result.png)

Bicubic Interpolation

![title](report_images/cub-32-result.png)

SubPixel CNN

![title](report_images/cnn-32-result.png)

|PSNR/SSIM| Img1 | Img2 | Img3 | Img4 | Img5 |
|----------|----------|----------|----------|----------|----------|
| Bilinear |25.086 / 0.755|23.402 / 0.740|24.437 / 0.700|24.838 / 0.720|18.353 / 0.607|
| Bicubic  |24.673 / 0.747|24.673 / 0.747|23.999 / 0.696|24.574 / 0.722|17.603 / 0.602|
| CNN      |23.972 / 0.737|22.225 / 0.688|22.811 / 0.661|23.301 / 0.699|16.291 / 0.559|

3D Depth maps for the third image

![title](report_images/3D-shapes-32_.png)

#### Up-Sampling 2x

Original 64x64

![title](report_images/orig-64.png)

Bilinear Interpolation

![title](report_images/lin-64-result.png)

Bicubic Interpolation

![title](report_images/cub-64-result.png)

SubPixel CNN

![title](report_images/cnn-64-result.png)

|PSNR/SSIM| Img1 | Img2 | Img3 | Img4 | Img5 |
|----------|----------|----------|----------|----------|----------|
| Bilinear |29.693 / 0.896|27.891 / 0.894|29.232 / 0.883|29.501 / 0.903|21.755 / 0.817|
| Bicubic  |29.374 / 0.894|27.466 / 0.886|29.020 / 0.888|29.361 / 0.910|20.965 / 0.817|
| CNN      |28.619 / 0.883|25.789 / 0.865|28.039 / 0.873|28.302 / 0.896|20.028 / 0.795|

3D Depth maps for the third image

![title](report_images/3D-shapes-64.png)

As could be seen by the presented results, the accuracy of the reconstruction is highly dependent on the resolution of the image used as input. Also, it is noticeable that the CNN was not able to present accurate results despite being a more computationally expensive method for up-scaling an image. This might be related to the fact that there were not many examples for the model to generalize good pixel approximations to the original images. In order to obtain better results, other more sophisticated architectures can be explored as well as different data augmentation techniques.

The face reconstruction algorithm was able to reconstruct even when faces were occluded. This property was stated initially by the authors as one of its pros; however, it failed some of the reconstructions where there was shadow and darker tons of skins in the image. This situation happened mainly because, in low-resolution, the difference in pixel intensity between background and the face is not discriminating enough, and after super-resolving the image, the reconstruction model had trouble to differentiate both sectors.

It is worth mentioning that the direct volumetric regression network took around 70 minutes to train only one epoch in an Intel i7-6500U with GeForce GTX 950M (4GB). This made the hyperparameter tuning and the search for different architectures a massively time-consuming task.

### Final Considerations

It could be checked that when the images were from resolutions up to 32x32, the reconstruction did not present accurate high frequency componentes for most of the cases since there were not enough details to produce good looking 128x128 images. Consequently, the deep network responsible for the regression of the mesh could not generate good approximations as well. However, when the images were from 64x64 and the needed upscaling was only up to 2x, the up-sampling algorithms performed better as it could be checked by their PSNR and SSIM. In this last case, the 3D depth estimation algorithm was successful in recovering fine details for the facial shape and pose.

### References

- Jackson, Aaron S., et al. "Large pose 3D face reconstruction from a single image via direct volumetric CNN regression." Proceedings of the IEEE International Conference on Computer Vision. 2017.
- Shi, Wenzhe, et al. "Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Winkler, Rouven, et al. "3D Face Reconstruction from Low-Resolution Images with Convolutional Neural Networks." Proceedings of the 2018 the 2nd International Conference on Video and Image Processing. ACM, 2018.