The aim of this project is to predict words or phrases from videos. This task can be solved with both machine learning and deep learning techniques (for example, you can use a CNN to extract features from video frames and process extracted features with an RNN). Pre-trained CNNs on human faces could improve classification results. Data augmentation techniques should be applied to increase the number of samples.
-
CNN (pre-trained VGG16) + Fine-Tuning + RNN (with LSTM)
-
CNN (pre-trained VGG16) + Fine-Tuning + RNN (with GRU)
-
CNN (pre-trained VGG16) + Fine-Tuning + RNN
Implement the architectures and make some benchmarks on them
- know how works VGG16
- how to use Keras
- how to transfer-learining
- how to do fine-tuning
- how to create a RNN (with custom layers)
- how to make benchmarks