<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - DeFakeIt : Detecting Deepfake Audio

>Author: Gilbert

---

**Context:**

The rapid advancement of artificial intelligence (AI) technology has revolutionized audio generation, enabling the creation of remarkably realistic synthetic speech. This innovation holds tremendous potential for enhancing accessibility, language translation, and entertainment, benefiting individuals and industries worldwide. However, the same AI capabilities have been exploited for malicious purposes, leading to the proliferation of deepfake audio used in scams, misinformation campaigns, and hate speech. The ease with which AI-generated audio can impersonate individuals and manipulate recordings has raised serious concerns about authenticity and trust in digital communication channels.

This has been a problem in Singapore as well, with recent deepfake video  circulating online impersonating PM Lee voice to promote cryptocurrency investment. [Source : Straits Times Dec 2023](https://www.straitstimes.com/singapore/pm-lee-warns-against-responding-to-deepfake-videos-of-him-promoting-investment-scams)

**Problem Statement:**  

How can we develop a model to effectively detect deepfake audio recordings, distinguishing between genuine human speech and AI generated sound for ensuring audio authenticity and combating the spread of misinformation and fraudulent activities?

**Target Audience:**  

SPF Scam Division  

These are the notebooks for this project:  
 1. [`01 Feature Engineering and EDA`](01_feature_engineering_and_EDA.ipynb)
 2. [`02 Baseline Modelling`](02_baseline_modelling.ipynb)
 3. [`03 Hyperparameter Tuning of Baseline Model`](03_hyperparametertuning_traditional_model.ipynb)
 4. [`04 Deep Learning Modelling`](04_deep_learning_modelling.ipynb)


---

 # This Notebook: 02_Traditional_Classifier_Modelling

In this notebook, we will create various of baseline classifier model and evaluate the top 3 best models to be further hypertuned on the next part of the notebook

To determine the appropriate baseline models, we initially evaluate both interpretability and performance based on the graph. Interpretability entails how easily one can understand and explain a model's predictions, whereas performance assesses the model's ability to capture data patterns effectively and generalize to new instances.

The table below compares the pros and cons of the different classification models:

| Classification Method           | Pros                                     | Cons                                                       | Usage Suggestions                                                                                                                  |
|--------------------------------|------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| Logistic Regression             | - Interpretable coefficients             | - Assumes linear relationship                             | Use for straightforward interpretation of how each feature influences the risk of chronic diseases. Suitable for cases where linear relationships between features and outcome are plausible. |
| Decision Trees                  | - Easy to interpret and visualize       | - Prone to overfitting                                    | Use for initial exploration of feature importance and identification of relevant predictors. Prune the tree to prevent overfitting. Suitable for both numerical and categorical data. |
| Random Forest                   | - Reduces overfitting                   | - Less interpretable than Decision Trees                  | Use for improved generalization by combining multiple decision trees. Utilize feature importance measures to understand which lifestyle factors contribute most to the risk of chronic diseases. Suitable for large datasets. |
| Extra Trees                     | - Reduces variance further              | - Sacrifices interpretability for improved performance   | Use for faster training and potentially better performance compared to Random Forest. Particularly useful when computational resources are limited, and interpretability is not the primary concern. |
| Support Vector Machines (SVM)   | - Effective in high-dimensional spaces | - Complexity in choosing the appropriate kernel          | Use for finding optimal hyperplanes to separate high-risk and low-risk individuals. Requires careful selection of hyperparameters and choice of kernel function. Suitable for cases with complex, non-linear relationships. |
| k-Nearest Neighbors (k-NN)     | - Simple and intuitive                  | - Sensitive to irrelevant features                        | Use for identifying high-risk individuals based on similarity to other high-risk cases in the dataset. Normalize features and tune the number of neighbors to improve performance. Suitable for small to medium-sized datasets. |
| Naive Bayes                     | - Computationally efficient            | - Assumes strong independence between features            | Use for quick classification of high-risk individuals based on conditional probabilities. Suitable for cases with categorical features and where independence assumptions are not severely violated. |
| Gradient Boosting Machines (GBM)| - Combines weak learners to improve accuracy | - Can be computationally expensive and prone to overfitting | Use for building a strong predictive model by sequentially correcting errors of weak models. Regularize hyperparameters to prevent overfitting. Suitable for datasets with complex relationships and high predictive accuracy requirements. |
| AdaBoost                        | - Sequentially combines weak learners   | - Sensitive to noisy data                                 | Use for iteratively adjusting weights to focus on previously misclassified cases. Prune weak learners to improve generalization. Suitable for ensemble learning when there's a large imbalance between high-risk and low-risk individuals. |
| XGBoost                         | - High performance and scalability     | - Less interpretable than simpler models                  | Use for maximizing predictive accuracy and handling large datasets. Tune hyperparameters to balance bias and variance. Suitable for situations where interpretability is less critical compared to predictive power. |

With the consideration of interpretability and performance, below are the selected 6 baseline model with good interpretability and performance.

| Classifier                   | Interpretability | Performance | Recommendations                                                                                                                                                                  |
|------------------------------|------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Logistic Regression          | High             | Moderate    | Suitable for linearly separable data, easy to interpret coefficients, works well with small to medium-sized datasets.                                                  |
| Random Forest                | Moderate         | High        | Combines multiple decision trees to reduce overfitting, robust to noise and outliers, suitable for large datasets with high dimensionality.                            |
| Support Vector Machines (SVM)| Low              | High        | Effective in high-dimensional spaces, versatile due to different kernel functions, can be memory intensive, suitable for small to medium-sized datasets.            |
| Gradient Boosting Machines (GBM)| Low           | High        | Ensemble method that combines weak learners to improve accuracy, less interpretable due to complexity, suitable for various types of data.                            |
| XGBoost                      | Low              | High        | Optimized implementation of gradient boosting, often outperforms other algorithms, less interpretable but highly accurate, suitable for large datasets.               |
|Decision Trees	               |Moderate	      |Moderate	    | Simple to understand and visualize, prone to overfitting with complex datasets, suitable for small to medium-sized datasets when used in ensemble methods.                |

 ---


#### **Import Libraries and Dataset**

In [1]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from imblearn.over_sampling import ADASYN
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier


In [2]:
#import dataset
audio_df = pd.read_csv('../data/audio_file.csv')

#### **Overview of the datasets**

---

This is done to have an overview of how the datasets look like and what the content is

In [3]:
audio_df.head()

Unnamed: 0,audio_file,label,zero_crossing_rate,spectral_centroid,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,...,chroma_feature_4,chroma_feature_5,chroma_feature_6,chroma_feature_7,chroma_feature_8,chroma_feature_9,chroma_feature_10,chroma_feature_11,chroma_feature_12,chroma_feature_13
0,c:\Users\User\GA\sandbox\capstone_project\data...,real,0.089659,1304.235799,-323.98456,110.490395,-15.569651,29.687305,-9.337172,-10.990068,...,0.321698,0.316995,0.316548,0.302425,0.274523,0.28704,0.29607,0.311789,0.331806,0.378438
1,c:\Users\User\GA\sandbox\capstone_project\data...,real,0.039649,799.118275,-343.70306,160.3429,-12.456814,16.461798,-1.550739,-12.00209,...,0.472099,0.447077,0.368103,0.304562,0.384273,0.362903,0.22481,0.212382,0.254745,0.259224
2,c:\Users\User\GA\sandbox\capstone_project\data...,real,0.093555,1297.138552,-311.03366,106.231544,-10.309275,24.133608,-11.460187,-11.904459,...,0.306986,0.329119,0.29285,0.265886,0.319642,0.348089,0.311742,0.355265,0.39882,0.364525
3,c:\Users\User\GA\sandbox\capstone_project\data...,real,0.089239,1316.380153,-353.7442,115.89293,-23.654213,33.060036,-12.907932,-9.533757,...,0.331827,0.310453,0.318776,0.314396,0.279102,0.25403,0.211233,0.242437,0.328217,0.382869
4,c:\Users\User\GA\sandbox\capstone_project\data...,real,0.072209,1167.977057,-338.6359,114.52832,-10.806395,34.839474,-0.268747,-7.760354,...,0.299801,0.330489,0.353679,0.359427,0.333285,0.28002,0.237893,0.272207,0.336213,0.351948


In [4]:
audio_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91700 entries, 0 to 91699
Data columns (total 43 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   audio_file          91700 non-null  object 
 1   label               91700 non-null  object 
 2   zero_crossing_rate  91700 non-null  float64
 3   spectral_centroid   91700 non-null  float64
 4   mfcc_1              91700 non-null  float64
 5   mfcc_2              91700 non-null  float64
 6   mfcc_3              91700 non-null  float64
 7   mfcc_4              91700 non-null  float64
 8   mfcc_5              91700 non-null  float64
 9   mfcc_6              91700 non-null  float64
 10  mfcc_7              91700 non-null  float64
 11  mfcc_8              91700 non-null  float64
 12  mfcc_9              91700 non-null  float64
 13  mfcc_10             91700 non-null  float64
 14  mfcc_11             91700 non-null  float64
 15  mfcc_12             91700 non-null  float64
 16  mfcc

#### **Check null values**

---

This is done to ensure that there is no empty cells on the datasets. 
Checking for null values in datasets is vital to maintain data integrity and accuracy. Null values can distort analysis results, leading to biased insights and erroneous conclusions.

In [5]:
audio_df.isnull().sum()

audio_file            0
label                 0
zero_crossing_rate    0
spectral_centroid     0
mfcc_1                0
mfcc_2                0
mfcc_3                0
mfcc_4                0
mfcc_5                0
mfcc_6                0
mfcc_7                0
mfcc_8                0
mfcc_9                0
mfcc_10               0
mfcc_11               0
mfcc_12               0
mfcc_13               0
d_mfcc_1              0
d_mfcc_2              0
d_mfcc_3              0
d_mfcc_4              0
d_mfcc_5              0
d_mfcc_6              0
d_mfcc_7              0
d_mfcc_8              0
d_mfcc_9              0
d_mfcc_10             0
d_mfcc_11             0
d_mfcc_12             0
d_mfcc_13             0
chroma_feature_1      0
chroma_feature_2      0
chroma_feature_3      0
chroma_feature_4      0
chroma_feature_5      0
chroma_feature_6      0
chroma_feature_7      0
chroma_feature_8      0
chroma_feature_9      0
chroma_feature_10     0
chroma_feature_11     0
chroma_feature_1

There are no null values identified

### **Data Preprocessing**

---

In this part of the notebook before going into modelling, we will do data preprocessing with the following steps:
1. Set X and Y variable, split the data into train and test data
2. Class proportion check and balancing
3. Standardscale the X features for both train and test

---

#### **Step 1: Set X and y, Split Train and Test Data**

Convert the label column as binary number for modelling purpose

We will convert the label of `real` and `fake` as follows:
1. `real` = 0
2. `fake` = 1

In [6]:
audio_df['label'] = audio_df['label'].apply(lambda x: 0 if x == 'real' else 1)

Split the dataset into train and test. `Train` data are used as the dataset to train the model, `test` data are to verify how the perform on unseen data.

In [7]:
#Set X and y variable 
X = audio_df.drop(columns = ['audio_file','label'])
y = audio_df['label']

X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y, random_state=42)

#### **Step 2: Class Proportion Check and Balancing**

---

Balancing datasets proportionally aims to address biases caused by unequal class distributions, ensuring that all classes receive equitable representation during model training. By balancing the dataset, models become less prone to favoring majority classes and can better generalize to minority classes, improving overall predictive performance

Check the proportion of the target variable y

In [28]:
y.value_counts(normalize=True)

label
1    0.857143
0    0.142857
Name: proportion, dtype: float64

Based on proportion of the fake and real, we can see that there is highly imbalance class. 
Addressing class imbalance is crucial for several reasons. Firstly, imbalanced data can lead to biased model training, where the model tends to favor the majority class and overlook the minority class. Failing to address class imbalance can lead to misleading conclusions and ineffective decision-making based on the model's outputs.

We will use `oversampling method (ADASYN)` to balance out the class, which is to amplify the minority class representation (`real`). The reason of not choosing `undersampling method` is that the `fake` data are created by different types of GANs, using undersampling could potentially discard valuable information from the `fake` class.

In [9]:
#create oversample with ADASYN
ada = ADASYN(random_state = 42)
X_train_resample, y_train_resample = ada.fit_resample(X_train,y_train)

Check the data proportion after resampling

In [27]:
y_train_resample.value_counts(normalize=True)

label
0    0.502027
1    0.497973
Name: proportion, dtype: float64

The datasets are balanced now.

#### **3. StandardScale Datasets**

---

StandardScaler is used to standardize features by removing the mean and scaling to unit variance. It transforms the data such that it has a mean of 0 and a standard deviation of 1, making it easier to compare and interpret the effects of different features on machine learning models.

We will have an overview of how the data looks like.

In [11]:
X_train_resample.head()

Unnamed: 0,zero_crossing_rate,spectral_centroid,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,...,chroma_feature_4,chroma_feature_5,chroma_feature_6,chroma_feature_7,chroma_feature_8,chroma_feature_9,chroma_feature_10,chroma_feature_11,chroma_feature_12,chroma_feature_13
0,0.144025,1804.273201,-388.875,93.59637,-34.15246,58.156796,-17.14077,-2.440681,-14.372906,-27.444033,...,0.329877,0.328536,0.308672,0.285831,0.325204,0.327426,0.335113,0.467133,0.502557,0.39901
1,0.092919,1374.336393,-356.26257,120.89103,-26.184362,44.641727,-8.690201,-6.37142,-10.669409,-26.60745,...,0.340985,0.364688,0.442239,0.421427,0.356675,0.294047,0.288776,0.329846,0.348246,0.363997
2,0.093125,1375.684725,-390.94904,125.07535,-26.158548,62.868484,-2.694161,-1.098944,-13.236238,-28.45022,...,0.283847,0.32017,0.377925,0.396044,0.402317,0.386759,0.399501,0.392347,0.342137,0.34053
3,0.060562,1026.450473,-352.30856,123.55159,-1.788278,21.13347,-10.293497,-17.433174,-17.481981,-9.52021,...,0.383526,0.414682,0.431017,0.410287,0.348064,0.293355,0.265428,0.295578,0.31372,0.331562
4,0.088338,1290.272797,-374.7139,110.32203,-14.796771,32.152317,-9.729928,-10.60081,-13.255384,-21.11938,...,0.271848,0.260946,0.243622,0.242083,0.298031,0.409493,0.383844,0.415713,0.49369,0.503742


Looking at the `X_train_resample` dataset, there are significant differences in terms of the values. Using standardscaler will normalized all of the features.

In [12]:
#apply scaling to the datasets 
ss = StandardScaler()

#fit and transform X_train
X_train_resample_ss = ss.fit_transform(X_train_resample)

#transform X_test
X_test_ss = ss.transform(X_test)

With the scaled data, we will move to the next step, `baseline modelling`.

### **Baseline Modelling**

---

After completing data preprocessing, the next step is to establish the baseline model. This initial model serves the purpose of identifying which approach holds the most promise for further optimization. By systematically exploring a diverse range of models, we ensure that every potential solution receives consideration, thereby providing a robust groundwork for subsequent model development and refinement efforts.

In [13]:
#instantitate model for base modelling
classifiers = {
    'Logistic Regression' : LogisticRegression(),
    'Random Forest' : RandomForestClassifier(),
    'Gradient Boost' : GradientBoostingClassifier(),
    'ADA Boost' : AdaBoostClassifier(),
    'Decision Tree' : DecisionTreeClassifier(),
    'SVC' : SVC(),
    'XG Boost' : XGBClassifier()
}

In [14]:
cv_score_list = []
train_score_list = []
test_score_list = []

#run prediction model 
for class_name, classifier in classifiers.items(): 
    #fit model
    classifier.fit(X_train_resample_ss, y_train_resample)

    #cross validate the score
    cv_train_score = cross_val_score(classifier, X_train_resample_ss,y_train_resample, cv =5)
    cv_test_score = cross_val_score(classifier,X_test_ss, y_test, cv=5)

    #append to cross validate score list
    cv_score_list.append({
        "Classifer" : class_name,
        'Train score' : cv_train_score.mean(),
        'Test score' : cv_test_score.mean()
    })
    print('cross_val_score done')

    #train scores
    train_pred = classifier.predict(X_train_resample_ss)
    train_accuracy = accuracy_score(y_train_resample, train_pred)
    train_precision = precision_score(y_train_resample, train_pred)
    train_recall = recall_score(y_train_resample,train_pred)
    train_f1 = f1_score(y_train_resample,train_pred)
    train_score_list.append({
        'Classifer' : class_name,
        'Accuracy' : train_accuracy,
        'Precision' : train_precision,
        'Recall' : train_recall,
        'f1_score' : train_f1
    })
    print('train score done')

    #test scores
    test_pred = classifier.predict(X_test_ss)
    test_accuracy = accuracy_score(y_test, test_pred)
    test_precision = precision_score(y_test, test_pred)
    test_recall = recall_score(y_test,test_pred)
    test_f1 = f1_score(y_test,test_pred)
    test_score_list.append({
        'Classifer' : class_name,
        'Accuracy' : test_accuracy,
        'Precision' : test_precision,
        'Recall' : test_recall,
        'f1_score' : test_f1
    })
    print('test score done')
    print(f'{class_name} cycle is completed')

#convert the list to dataframe
cv_result_df = pd.DataFrame(cv_score_list)
train_result_df = pd.DataFrame(train_score_list)
test_result_df = pd.DataFrame(test_score_list)

cross_val_score done
train score done
test score done
Logistic Regression cycle is completed
cross_val_score done
train score done
test score done
Random Forest cycle is completed
cross_val_score done
train score done
test score done
Gradient Boost cycle is completed




cross_val_score done
train score done
test score done
ADA Boost cycle is completed
cross_val_score done
train score done
test score done
Decision Tree cycle is completed
cross_val_score done
train score done
test score done
SVC cycle is completed
cross_val_score done
train score done
test score done
XG Boost cycle is completed


#### **Display the result of each models**

---

The intention is to compare and select the best baseline model out of 6 classification model to be further hypertuned

In [21]:
#display cross_val_score result 
cv_result_df

Unnamed: 0,Classifer,Train score,Test score
0,Logistic Regression,0.686391,0.859019
1,Random Forest,0.811573,0.855573
2,Gradient Boost,0.674328,0.857754
3,ADA Boost,0.654477,0.855703
4,Decision Tree,0.675511,0.75747
5,SVC,0.774176,0.857099
6,XG Boost,0.744577,0.849422


In [19]:
#display train result
train_result_df

Unnamed: 0,Classifer,Accuracy,Precision,Recall,f1_score
0,Logistic Regression,0.687684,0.69106,0.674249,0.682551
1,Random Forest,1.0,1.0,1.0,1.0
2,Gradient Boost,0.693276,0.703429,0.664003,0.683148
3,ADA Boost,0.664842,0.669636,0.645327,0.657256
4,Decision Tree,1.0,1.0,1.0,1.0
5,SVC,0.835226,0.855967,0.804478,0.829424
6,XG Boost,0.841629,0.863674,0.809788,0.835863


In [20]:
#display test result
test_result_df

Unnamed: 0,Classifer,Accuracy,Precision,Recall,f1_score
0,Logistic Regression,0.674329,0.92685,0.673181,0.779907
1,Random Forest,0.727895,0.852836,0.824885,0.838628
2,Gradient Boost,0.644057,0.901748,0.656234,0.759647
3,ADA Boost,0.631799,0.897849,0.643664,0.7498
4,Decision Tree,0.622334,0.856051,0.672468,0.753235
5,SVC,0.726107,0.899158,0.766412,0.827495
6,XG Boost,0.695921,0.890965,0.735216,0.805632


To select which models perform the best, we evaluate based on the confusion matrix.
Below is the confusion matrix breakdown:

1. True Positive: Predict that the audio is fake, actual is fake

2. True Negative: Predict that the audio is not fake, actual is not fake

3. False Positive: Predict that the audio is fake, actual is not fake 

4. False Negative: Predict that the audio is not fake, actual is fake

Our primary metric of concern is `recall`, as it directly relates to minimizing `false negatives`. `False negatives` occur when our model predicts that an audio clip is genuine when it is actually fake. This poses a significant risk to our users, as they may trust the authenticity of the audio when they should not. For instance, in scenarios involving scams, a high `false negative` could result in our users being deceived into believing that the audio is legitimate, leading to potential financial losses or other harm. Therefore, our goal is to minimize `false negatives` in order to enhance the overall reliability and safety of our system.

Considering our objective of minimizing false negatives in classification models, we focus on evaluating the `recall` score. Maximizing the `recall` score effectively reduces the occurrence of `false negatives` which aligns with our goal of enhancing model performance and mitigating the risk associated with incorrectly identifying fake audio clips as real

Based on the model above, we will pick top 3 model by evaluating the highest recall score.

---

#### **Selected models:**
1. `Random Forest`
2. `Logistic Regression`
3. `XG Boost`

We will proceed to the next notebook for hypertuning.

---

Next : [03 Hyperparameter Tuning of Baseline Model](03_hyperparametertuning_traditional_model.ipynb)

