In [1]:
import os
import util
import model_builder
import numpy as np
import pandas as pd
import tensorflow as tf

Working plan - Since it's seen that MobileNetV2 + GRU gave an adequately good result, I want to try the architecture (the best alone or the top 5) on some variations:
1. on the same videos decomposed into more number of frames (I did 16, try 32 and 64)
2. on a different dataset (I used DFD, try CelebDF)
3. using xceptionnet or some other pretrained model (so adjust image sizes accordingly)

Make generic functions so that any data, any number of frames, and any pretrained model can be used. Save all the best ones.

In [2]:
base_dir = r'data'
data_sources = ['DFD', 'CelebDF']
num_frames = [16, 32, 64, 128, 256]

In [3]:
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)

# DFD More Frames

In [None]:
labels1, classifier1, (classifier_loss1, classifier_acc1) = model_builder.train_test_classifier(
    data_dir=os.path.join(base_dir, 'DFD'),
    num_frames=64)

Image Model: MobileNetV2, Image Size: (224, 224)
TRAIN set: 140 videos
VAL set: 30 videos
TEST set: 30 videos
TRAIN set: 140 videos


# DFD EfficientNetB3

In [None]:
labels2, classifier2, (classifier_loss2, classifier_acc2) = model_builder.train_test_classifier(
    data_dir=os.path.join(base_dir, 'DFD'),
    num_frames=16,
    img_model_name='EfficientNetB3')

# CelebDF (best pretrained model) & (best num frames)

In [None]:
labels2, classifier2, (classifier_loss2, classifier_acc2) = model_builder.train_test_classifier(
    data_dir=os.path.join(base_dir, 'CelebDF'),
    num_frames=,
    img_model_name='')

* Adding more FC layers and more GRUs seems to improve the performance, but when the number of FCs exceed GRUs the performance drops.
* Including dropouts between GRU and FC layer and between the FCs, also result in better performing models, while including BatchNormalization gives mixed results.
* The best performing achitecture is one with 2 GRUs and 2 FCs with dropouts between GRU-FC and between FCs, with an accuracy of **66.7%** which is also the highest accuracy obtained among all the experimented models. The second highest accuracy seen is **63.3%** from the model having 3 GRUs with BatchNormalization between every pair and 2 FC layers with dropouts before and after each.

**It should also be noted that these models were extremely quick to train, which made trying out several different architectures very easy.**

* This architecture can be tuned further with the inclusion of different regularization parameters, different dropout rates, different optimizers, and momentum-based or scheduled learning rates.
  * Since, the training is fast as it is, adding momentum may not necessarily help.
  * Scheduling the learning rates and making it slower after a while may have higher scope of giving an improvement, even though all of the above trials used a small learning rate of 1e-5 (perhaps even smaller learning rates could help in this case).

* Furthermore, we could also try other pretrained models for obtaining the embeddings. But we need to take care of the sizes of the images that are fed into those models.
  * densenet, efficientnetb0, mobilenetv2, resnet50 -> 224x224
  * xception, inceptionv3 -> 299x299
  * efficientnetb3 -> 300x300

**Regardless of the model and technique used, we don't appear to get any high values of accuracy. This is due to the small size of the dataset and also because of the nature of the dataset. The deepfake videos aren't entirely AI-generated, instead the faces/expressions alone, of the people in the videos, have been swapped/altered. So, our model needs to identify the fakeness of the video from a very small spatial range of the frames. That is a sensitive task, and a model will only be able to handle that if it were fed a significantly large dataset to learn from.**