# Deepfake Detection: Proof of Concept with Feature Extraction and Modeling
## Team Quarks (Ali & Belal)
## Objective
The objective of this notebook is to test a proof of concept for deepfake detection. We will employ basic machine learning models to assess the predictive power of the facial landmark variance feature, which was identified as a potential indicator during our EDA.
## Data Description
The dataset consists of facial landmark data extracted from a series of videos. Each entry in the dataset represents a video and includes the variance of facial landmark velocities in the X and Y axes, as well as the percentage of frames in which a face was successfully detected.

In [1]:
import os
import pandas as pd
import json

def create_dataframe_from_json(directory):
    data = []
    errors = []

    # List all files in the given directory
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            file_path = os.path.join(directory, filename)

            try:
                # Read the JSON file
                with open(file_path, 'r') as file:
                    json_data = json.load(file)
                    
                # Start the dictionary with the video name
                video_data = {'video_name': filename.replace('.json', '')}
                # Update this dictionary with the overall_features
                video_data.update(json_data.get("overall_features", {}))

                data.append(video_data)

            except Exception as e:
                errors.append((filename, str(e)))

    # Create a DataFrame
    df = pd.DataFrame(data)

    # Rename columns to be more descriptive
    column_renaming = {
        'chin_xvel_var': 'Chin X-Axis Velocity Variance',
        'chin_yvel_var': 'Chin Y-Axis Velocity Variance',
        'left_eyebrow_xvel_var': 'Left Eyebrow X-Axis Velocity Variance',
        'left_eyebrow_yvel_var': 'Left Eyebrow Y-Axis Velocity Variance',
        'right_eyebrow_xvel_var': 'Right Eyebrow X-Axis Velocity Variance',
        'right_eyebrow_yvel_var': 'Right Eyebrow Y-Axis Velocity Variance',
        'nose_bridge_xvel_var': 'Nose Bridge X-Axis Velocity Variance',
        'nose_bridge_yvel_var': 'Nose Bridge Y-Axis Velocity Variance',
        'nose_tip_xvel_var': 'Nose Tip X-Axis Velocity Variance',
        'nose_tip_yvel_var': 'Nose Tip Y-Axis Velocity Variance',
        'left_eye_xvel_var': 'Left Eye X-Axis Velocity Variance',
        'left_eye_yvel_var': 'Left Eye Y-Axis Velocity Variance',
        'right_eye_xvel_var': 'Right Eye X-Axis Velocity Variance',
        'right_eye_yvel_var': 'Right Eye Y-Axis Velocity Variance',
        'top_lip_xvel_var': 'Top Lip X-Axis Velocity Variance',
        'top_lip_yvel_var': 'Top Lip Y-Axis Velocity Variance',
        'bottom_lip_xvel_var': 'Bottom Lip X-Axis Velocity Variance',
        'bottom_lip_yvel_var': 'Bottom Lip Y-Axis Velocity Variance',
        'face_detection_percentage': 'Face Detection Percentage',
        'label': 'Video Authenticity Label'
    }
    df = df.rename(columns=column_renaming)

    return df, errors

directory_path = "/data1/belalm/Capstone/data/landmarks"
df, errors = create_dataframe_from_json(directory_path)

print(df) 

if errors:
    print("Errors encountered:")
    for error in errors:
        print(error)


     video_name  Chin X-Axis Velocity Variance  Chin Y-Axis Velocity Variance  \
0    wvrzowftpz                       0.096015                       0.328666   
1    spoezekgpo                       1.402898                      14.888691   
2    zocvwatcsf                       0.237263                       0.784404   
3    ksdohprrko                     244.722785                     177.297338   
4    vzwjztahsh                       0.476414                       2.015065   
..          ...                            ...                            ...   
250  wzgbtkbgkg                       2.250000                       0.778547   
251  fgmqrobblg                       3.213788                      10.255761   
252  odvqbtoczb                       0.435622                       8.914122   
253  bxriwqpced                       0.429208                       0.912364   
254  hyqszrnkhz                       0.540259                       1.310818   

     Left Eyebrow X-Axis Ve

## Model Testing
In this section, we load the preprocessed data and implement several machine learning models to assess their performance.


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

X = df.drop(['video_name', 'Video Authenticity Label'], axis=1)
y = df['Video Authenticity Label'].map({'FAKE': 0, 'REAL': 1})  # Convert labels to binary

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values using the mean of each column
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Scale the features to be used by SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)


## Train a Logistic Regression model

In [3]:
# Initialize the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)

# Train the model using the imputed and scaled training data
logistic_model.fit(X_train_scaled, y_train)

# Predict on the imputed and scaled test set
logistic_predictions = logistic_model.predict(X_test_scaled)

# Evaluate the Logistic Regression model
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
print(f"Logistic Regression Accuracy: {logistic_accuracy}")
print(classification_report(y_test, logistic_predictions))


Logistic Regression Accuracy: 0.5294117647058824
              precision    recall  f1-score   support

           0       0.55      0.42      0.48        26
           1       0.52      0.64      0.57        25

    accuracy                           0.53        51
   macro avg       0.53      0.53      0.52        51
weighted avg       0.53      0.53      0.52        51



In [4]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation for Logistic Regression
logistic_cv_scores = cross_val_score(logistic_model, X_train_scaled, y_train, cv=5)
print(f"Logistic Regression CV scores: {logistic_cv_scores}")
print(f"Logistic Regression CV mean score: {logistic_cv_scores.mean()}")

Logistic Regression CV scores: [0.48780488 0.56097561 0.65853659 0.58536585 0.65      ]
Logistic Regression CV mean score: 0.5885365853658537


## Train a Support Vector Machine model

In [5]:
# Initialize the SVM model
svm_model = SVC()

# Train the SVM model on the scaled data
svm_model.fit(X_train_scaled, y_train)

# Predict on the scaled test set
svm_predictions = svm_model.predict(X_test_scaled)

# Evaluate the SVM model
svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f"SVM Accuracy: {svm_accuracy}")
print(classification_report(y_test, svm_predictions))

SVM Accuracy: 0.5490196078431373
              precision    recall  f1-score   support

           0       0.58      0.42      0.49        26
           1       0.53      0.68      0.60        25

    accuracy                           0.55        51
   macro avg       0.56      0.55      0.54        51
weighted avg       0.56      0.55      0.54        51



In [6]:
# Perform 5-fold cross-validation for SVM
svm_cv_scores = cross_val_score(svm_model, X_train_scaled, y_train, cv=5)
print(f"SVM CV scores: {svm_cv_scores}")
print(f"SVM CV mean score: {svm_cv_scores.mean()}")

SVM CV scores: [0.51219512 0.53658537 0.63414634 0.58536585 0.65      ]
SVM CV mean score: 0.5836585365853659


## Train a Random Forest Classifier

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the imputed data (no need to scale for tree-based models)
rf_model.fit(X_train_imputed, y_train)

# Predict on the imputed test set
rf_predictions = rf_model.predict(X_test_imputed)

# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Accuracy: {rf_accuracy}")
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.6666666666666666
              precision    recall  f1-score   support

           0       0.63      0.85      0.72        26
           1       0.75      0.48      0.59        25

    accuracy                           0.67        51
   macro avg       0.69      0.66      0.65        51
weighted avg       0.69      0.67      0.65        51



In [8]:
# Perform 5-fold cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_train_imputed, y_train, cv=5)
print(f"Random Forest CV scores: {rf_cv_scores}")
print(f"Random Forest CV mean score: {rf_cv_scores.mean()}")

Random Forest CV scores: [0.58536585 0.68292683 0.65853659 0.70731707 0.575     ]
Random Forest CV mean score: 0.6418292682926829


## Train a Gradient Boosting Classifier

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the Gradient Boosting model on the imputed data
gb_model.fit(X_train_imputed, y_train)

# Predict on the imputed test set
gb_predictions = gb_model.predict(X_test_imputed)

# Evaluate the Gradient Boosting model
gb_accuracy = accuracy_score(y_test, gb_predictions)
print(f"Gradient Boosting Accuracy: {gb_accuracy}")
print(classification_report(y_test, gb_predictions))

Gradient Boosting Accuracy: 0.5686274509803921
              precision    recall  f1-score   support

           0       0.57      0.65      0.61        26
           1       0.57      0.48      0.52        25

    accuracy                           0.57        51
   macro avg       0.57      0.57      0.56        51
weighted avg       0.57      0.57      0.57        51



In [10]:
# Perform 5-fold cross-validation for Gradient Boosting
gb_cv_scores = cross_val_score(gb_model, X_train_imputed, y_train, cv=5)
print(f"Gradient Boosting CV scores: {gb_cv_scores}")
print(f"Gradient Boosting CV mean score: {gb_cv_scores.mean()}")

Gradient Boosting CV scores: [0.53658537 0.70731707 0.58536585 0.73170732 0.6       ]
Gradient Boosting CV mean score: 0.6321951219512194


# Conclusion
In this notebook, we have explored a proof of concept for deepfake detection using facial landmark variance as a distinguishing feature between real and fake videos. Our initial exploratory data analysis suggested that variances in the movement of facial landmarks could be promising indicators of video authenticity.

We did this by extracting facial landmark data from a subset of our video dataset and generating features based on the variance of landmark movements. We then trained basic machine learning models, including Logistic Regression and Support Vector Machines, to classify videos as real or fake based on these features.

The performance of these initial models provided encouraging results, with accuracy scores that demonstrate the potential viability of using landmark variance as a feature for deepfake detection. While the accuracy is not perfect, it is significantly better than random chance, suggesting that the features contain meaningful information about video authenticity.

Next Steps 

1.) The results from this notebook serve as a strong foundation for our next phase of work, which involves several key steps:

2.) Full Dataset Training: We will scale up our efforts by training models on the full dataset, which will likely enhance the robustness and generalizability of our findings.

3.) Refined Model Development: A more sophisticated model will be developed to directly compare pairs of videos — one real and one fake — to identify the inauthentic one. This approach is expected to have high accuracy as it will leverage the subtle differnces between an original and its corresponding deepfake.

Feature Engineering and Model Tuning: Further feature engineering and hyperparameter tuning will be conducted to improve the models. 

By following these steps, we aim to develop a robust deepfake detection system that can serve as a valuable tool in the fight against digital misinformation. Our work contributes to the broader effort to maintain integrity and trust in digital media.

