<a href="https://colab.research.google.com/github/ai4all-deepfake-project/ai4all-deepfake-detector/blob/data-preprocessing/DeepFake_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deepfake Detection with CNN and OpenCV
---

### The Dataset

Using the [SDFVD 2.0 dataset](https://data.mendeley.com/datasets/zzb7jyy8w8/1) from Mendeley, this includes:

*   461 real videos
*   461 fake videos

Clips are short, high-quality, and feature diverse faces. They are augmented.

---
### Steps

1. Dataset Organization
  The dataset is already structured into two main folders `SDFVD2.0_real/` and `SDFVD2.0_fake/` each containing:

   *   `original/` - raw unaltered videos
   *   `augmented/` - videos with transformations (e.g brightness, noise, blur)

  Each augmented video follows this naming pattern:

  `<prefix>_<original_filename>_aug_<augmentation_index>.mp4`

  This makes it easier to manage training splits. Splitting data into training and testing sets is an important step. It assists with evaluating a model's performance on unseen data and prevent overfitting.

2. Load Videos and Extract Frames (OpenCV)

  Extract frames from each `mp4` file.
  Organized extracted frames into `processed_frames/` organized by `real/` or `fake/` and video source `original/` or `augmented/`

3. Detect Faces (OpenCV)

  We want consistency, so its important to focus only on the relevant part of the image. Run face detection on each frame to isolate the facial region.

4. Preprocess Images
For each detected face:

  *   Crop to face region
  *   Resize to 224 × 224
  * Convert BGR → RGB
  * Normalize pixels for CNN input

5. Train the CNN

  Train **convolutional neural network** to classify each image as:
    *   `0` → Real
    *   `1` → Fake

  Use training loops with loss functions like `CrossEntropyLoss` and optimizers like `Adam`.

6. Predict from Preprocessed Frames

  Run trained CNN on new frames and collect predictions either labels or probabilities.

   > **(OPTIONAL) The PyTorch GRAD-CAM to explain the model's predictions**
   >
   > [The PyTorch Grad-Cam Library](https://github.com/jacobgil/pytorch-grad-cam) implements several methods to interpret the decision of CNN when classifying an image real or fake
   
![Example on Github, replace with our own](https://raw.githubusercontent.com/jacobgil/jacobgil.github.io/master/assets/cam_dog.gif)

7.  Apply some logic to classify the entire video as fake or real. We can do this by:
  * If most frames are fake → video is fake
  * Classify the entire video as fake or real by averaging the frame-level fake probabilities. If the average exceeds a threshold, label it fake; otherwise, real.



# 1. Mount Google Drive in Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import sys
import os

project_path = '/content/drive/MyDrive/AI4ALL Group 3C - Technology & Engineering/project'
src_path = os.path.join(project_path, 'src')

sys.path.append(src_path)
os.chdir(project_path)

# 2. Frame Splitting

In [None]:
# project/
# ├── SDFVD2.0 Extension of Small Scale Deep Fake Video Dataset/
# │   ├── Real/
# │   │   ├── real_vid1.mp4
# │   │   ├── ...
# │   ├── Fake/
# │   │   ├── fake_vid1.mp4
# │   │   ├── ...
# ├── frames/
      # ├── Real/
      # │   ├── real_video1_frame_000.jpg
      # │   ├── ...
      # ├── Fake/
      # │   ├── fake_video1_frame_000.jpg
      # │   ├── ...

In [None]:
ls

 DeepFake-Detector.ipynb
 [0m[01;34mframes[0m/
 requirements.txt
[01;34m'SDFVD2.0 Extension of Small Scale Deep Fake Video Dataset'[0m/
 [01;34msrc[0m/


In [None]:
import os

print("Current working directory:", os.getcwd())
print("\nSubdirectories:")
print(os.listdir())

Current working directory: /content/drive/MyDrive/AI4ALL Group 3C - Technology & Engineering/project

Subdirectories:
['SDFVD2.0 Extension of Small Scale Deep Fake Video Dataset', 'requirements.txt', 'src', 'DeepFake-Detector.ipynb', 'frames']


In [None]:
dataset_root = "SDFVD2.0 Extension of Small Scale Deep Fake Video Dataset"

print("Contents of dataset folder:")
print(os.listdir(dataset_root))

Contents of dataset folder:
['SDFVD2.0_fake', 'SDFVD2.0_real']


In [None]:
import cv2	# •	cv2 = OpenCV, used here to read videos and write frames as images.
import os
from tqdm import tqdm	# •	tqdm = Shows a progress bar so you can track which videos are being processed.

dataset_root = 'SDFVD2.0 Extension of Small Scale Deep Fake Video Dataset'
output_root = 'frames'
os.makedirs(output_root, exist_ok=True) # creates 'frames' folder

# One frame every 5 frames - to keep things managable. Possible variable to test/experiement with
frame_interval = 5

# This function takes a folder of videos and saves selected frames into a matching output folder
def extract_frames_from_folder(input_folder, output_folder, label):
    # create "Real" or "Fake" folder
    os.makedirs(output_folder, exist_ok=True)

    video_files = [f for f in os.listdir(input_folder) if f.endswith(".mp4")]

    for video_file in tqdm(video_files, desc=f"Processing {label} videos"):
        video_path = os.path.join(input_folder, video_file)
        video_id = os.path.splitext(video_file)[0] # id = file name + frame number
        cap = cv2.VideoCapture(video_path) # OpenCV object to read the video, frame by frame.

        frame_count = 0
        saved_count = 0

        # continuously read next frame
        while True:
            success, frame = cap.read()
            if not success: # quit loop if fails or reaches end
                break

            # to catch every 5th frame
            if frame_count % frame_interval == 0:
                out_filename = f"{video_id}_frame_{saved_count:03d}.jpg"
                out_path = os.path.join(output_folder, out_filename)
                cv2.imwrite(out_path, frame) # output named frame
                saved_count += 1

            frame_count += 1

        cap.release()

# Process both Real and Fake folders
label_map = {
    "Real": "SDFVD2.0_real",
    "Fake": "SDFVD2.0_fake"
}

# call extract_frames_from_folder on both folders containing real and fake videos
for label, folder_name in label_map.items():
    input_folder = os.path.join(dataset_root, folder_name)
    output_folder = os.path.join("frames", label)
    extract_frames_from_folder(input_folder, output_folder, label)

Processing Real videos: 100%|██████████| 456/456 [06:26<00:00,  1.18it/s]
Processing Fake videos: 100%|██████████| 471/471 [06:42<00:00,  1.17it/s]
