FaceViT: A lightweight multitask Vision Transformer for face detection, age prediction and gender classification
FaceViT: A small-sized multi-task Vision Transformer trained from scratch for face detection, age estimation, and gender prediction, demonstrating the ability of the Vision Transformer to perform great across different tasks such as object detection and multiple classifications simultaneously.
For this project I have used the UTK Faces dataset, which you can download with a Kaggle account from here:
NOTE: This dataset is heavily imbalanced when it comes to ages of people, while also containing a lot of relatively low quality images. This could be a limiting factor for performance over the different tasks.
NOTE: During this training experiment no augmentations were used.
Metric Name | Value | Training Epochs |
---|---|---|
Top-3 Age Accuracy | 61 % | 43 |
Face Bounding Box MSE | 0.0054 | 43 |
Gender Accuracy | 75 % | 43 |