# Deepfake Detection: Proof of Concept with Feature Extraction and Modeling
## Team Quarks (Ali & Belal)
## Objective
The objective of this notebook is to test a proof of concept for deepfake detection. We will employ basic machine learning models to assess the predictive power of the facial landmark variance feature, which was identified as a potential indicator during our EDA.
## Data Description
The dataset consists of facial landmark data extracted from a series of videos. Each entry in the dataset represents a video and includes the variance of facial landmark velocities in the X and Y axes, as well as the percentage of frames in which a face was successfully detected.

In [1]:
import os
import pandas as pd
import json

def create_dataframe_from_json(directory):
    data = []
    errors = []

    # List all files in the given directory
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            file_path = os.path.join(directory, filename)

            try:
                # Read the JSON file
                with open(file_path, 'r') as file:
                    json_data = json.load(file)
                    
                # Start the dictionary with the video name
                video_data = {'video_name': filename.replace('.json', '')}
                # Update this dictionary with the overall_features
                video_data.update(json_data.get("overall_features", {}))

                data.append(video_data)

            except Exception as e:
                errors.append((filename, str(e)))

    # Create a DataFrame
    df = pd.DataFrame(data)

    # Rename columns to be more descriptive
    column_renaming = {
        'chin_xvel_var': 'Chin X-Axis Velocity Variance',
        'chin_yvel_var': 'Chin Y-Axis Velocity Variance',
        'left_eyebrow_xvel_var': 'Left Eyebrow X-Axis Velocity Variance',
        'left_eyebrow_yvel_var': 'Left Eyebrow Y-Axis Velocity Variance',
        'right_eyebrow_xvel_var': 'Right Eyebrow X-Axis Velocity Variance',
        'right_eyebrow_yvel_var': 'Right Eyebrow Y-Axis Velocity Variance',
        'nose_bridge_xvel_var': 'Nose Bridge X-Axis Velocity Variance',
        'nose_bridge_yvel_var': 'Nose Bridge Y-Axis Velocity Variance',
        'nose_tip_xvel_var': 'Nose Tip X-Axis Velocity Variance',
        'nose_tip_yvel_var': 'Nose Tip Y-Axis Velocity Variance',
        'left_eye_xvel_var': 'Left Eye X-Axis Velocity Variance',
        'left_eye_yvel_var': 'Left Eye Y-Axis Velocity Variance',
        'right_eye_xvel_var': 'Right Eye X-Axis Velocity Variance',
        'right_eye_yvel_var': 'Right Eye Y-Axis Velocity Variance',
        'top_lip_xvel_var': 'Top Lip X-Axis Velocity Variance',
        'top_lip_yvel_var': 'Top Lip Y-Axis Velocity Variance',
        'bottom_lip_xvel_var': 'Bottom Lip X-Axis Velocity Variance',
        'bottom_lip_yvel_var': 'Bottom Lip Y-Axis Velocity Variance',
        'face_detection_percentage': 'Face Detection Percentage',
        'label': 'Video Authenticity Label'
    }
    df = df.rename(columns=column_renaming)

    return df, errors

directory_path = "/data1/belalm/Capstone/data/landmarks"
df, errors = create_dataframe_from_json(directory_path)

print(df) 

if errors:
    print("Errors encountered:")
    for error in errors:
        print(error)


      video_name  Chin X-Axis Velocity Variance  \
0     yygjogokma                       4.484175   
1     rbsqytvobx                       0.486159   
2     zqfsdlgkvx                     311.568123   
3     vrprflgvys                       0.690445   
4     frgdpgbfmh                       1.456422   
...          ...                            ...   
3326  sejttfsefa                            NaN   
3327  miywrjuewa                      11.097915   
3328  hyqszrnkhz                       0.540259   
3329  gjjvdnsuwx                       3.711649   
3330  nwgbsdkryv                       4.956775   

      Chin Y-Axis Velocity Variance  Left Eyebrow X-Axis Velocity Variance  \
0                          9.035285                              10.279010   
1                          1.283731                               1.197899   
2                       1157.834810                              13.817222   
3                          3.933908                               0.527870 

## Model Testing
In this section, we load the preprocessed data and implement several machine learning models to assess their performance.


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

X = df.drop(['video_name', 'Video Authenticity Label'], axis=1)
y = df['Video Authenticity Label'].map({'FAKE': 0, 'REAL': 1})  # Convert labels to binary

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values using the mean of each column
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Scale the features to be used by SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)


## Model Selection Rationale
### Logistic Regression, SVM, Random Forest, Gradient Boosting
- Chosen for their efficacy in binary classification tasks.
- **Assumptions and Limitations:** Each model's assumptions are evaluated against the data characteristics.
- **Model Comparisons:** Theoretical and practical aspects are compared.

## Cross-Validation Strategy
### 5-Fold Cross-Validation
- Ensures robust performance estimation.
- Results imply model generalizability.

## Train a Logistic Regression model

In [3]:
# Initialize the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)

# Train the model using the imputed and scaled training data
logistic_model.fit(X_train_scaled, y_train)

# Predict on the imputed and scaled test set
logistic_predictions = logistic_model.predict(X_test_scaled)

# Evaluate the Logistic Regression model
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
print(f"Logistic Regression Accuracy: {logistic_accuracy}")
print(classification_report(y_test, logistic_predictions))


Logistic Regression Accuracy: 0.5907046476761619
              precision    recall  f1-score   support

           0       0.69      0.32      0.44       332
           1       0.56      0.86      0.68       335

    accuracy                           0.59       667
   macro avg       0.63      0.59      0.56       667
weighted avg       0.63      0.59      0.56       667



In [4]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation for Logistic Regression
logistic_cv_scores = cross_val_score(logistic_model, X_train_scaled, y_train, cv=5)
print(f"Logistic Regression CV scores: {logistic_cv_scores}")
print(f"Logistic Regression CV mean score: {logistic_cv_scores.mean()}")

Logistic Regression CV scores: [0.57973734 0.61163227 0.59662289 0.55909944 0.57142857]
Logistic Regression CV mean score: 0.5837041007772715


## Train a Random Forest Classifier

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the imputed data (no need to scale for tree-based models)
rf_model.fit(X_train_imputed, y_train)

# Predict on the imputed test set
rf_predictions = rf_model.predict(X_test_imputed)

# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Accuracy: {rf_accuracy}")
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.7466266866566716
              precision    recall  f1-score   support

           0       0.75      0.73      0.74       332
           1       0.74      0.76      0.75       335

    accuracy                           0.75       667
   macro avg       0.75      0.75      0.75       667
weighted avg       0.75      0.75      0.75       667



In [8]:
# Perform 5-fold cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_train_imputed, y_train, cv=5)
print(f"Random Forest CV scores: {rf_cv_scores}")
print(f"Random Forest CV mean score: {rf_cv_scores.mean()}")

Random Forest CV scores: [0.75984991 0.79362101 0.73358349 0.73921201 0.7387218 ]
Random Forest CV mean score: 0.7529976442043195


## Train a Gradient Boosting Classifier

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the Gradient Boosting model on the imputed data
gb_model.fit(X_train_imputed, y_train)

# Predict on the imputed test set
gb_predictions = gb_model.predict(X_test_imputed)

# Evaluate the Gradient Boosting model
gb_accuracy = accuracy_score(y_test, gb_predictions)
print(f"Gradient Boosting Accuracy: {gb_accuracy}")
print(classification_report(y_test, gb_predictions))

Gradient Boosting Accuracy: 0.7076461769115442
              precision    recall  f1-score   support

           0       0.70      0.71      0.71       332
           1       0.71      0.70      0.71       335

    accuracy                           0.71       667
   macro avg       0.71      0.71      0.71       667
weighted avg       0.71      0.71      0.71       667



In [10]:
# Perform 5-fold cross-validation for Gradient Boosting
gb_cv_scores = cross_val_score(gb_model, X_train_imputed, y_train, cv=5)
print(f"Gradient Boosting CV scores: {gb_cv_scores}")
print(f"Gradient Boosting CV mean score: {gb_cv_scores.mean()}")

Gradient Boosting CV scores: [0.71294559 0.74296435 0.69606004 0.69606004 0.71240602]
Gradient Boosting CV mean score: 0.712087206759864


# Conclusion
In this notebook, we have explored a proof of concept for deepfake detection using facial landmark variance as a distinguishing feature between real and fake videos. Our initial exploratory data analysis suggested that variances in the movement of facial landmarks could be promising indicators of video authenticity.

We did this by extracting facial landmark data from a subset of our video dataset and generating features based on the variance of landmark movements. We then trained basic machine learning models, including Logistic Regression and Support Vector Machines, to classify videos as real or fake based on these features.

The performance of these initial models provided encouraging results, with accuracy scores that demonstrate the potential viability of using landmark variance as a feature for deepfake detection. While the accuracy is not perfect, it is significantly better than random chance, suggesting that the features contain meaningful information about video authenticity.

Next Steps 

1.) The results from this notebook serve as a strong foundation for our next phase of work, which involves several key steps:

2.) Full Dataset Training: We will scale up our efforts by training models on the full dataset, which will likely enhance the robustness and generalizability of our findings.

3.) Refined Model Development: A more sophisticated model will be developed to directly compare pairs of videos — one real and one fake — to identify the inauthentic one. This approach is expected to have high accuracy as it will leverage the subtle differnces between an original and its corresponding deepfake.

Feature Engineering and Model Tuning: Further feature engineering and hyperparameter tuning will be conducted to improve the models. 

By following these steps, we aim to develop a robust deepfake detection system that can serve as a valuable tool in the fight against digital misinformation. Our work contributes to the broader effort to maintain integrity and trust in digital media.

