This notebook trains a machine learning capable of detecting DeepFake (spoofed) voice samples and
distinguishing them from real human speech.

Based on LA – (Logical Access) type audio set given in ASVspoof 2019 Dataset.
Logical Access attacks occur when fake speech is directly fed into the system digitally, without using a microphone or speaker. i.e. Clean, artificial signals, like No room noise and No microphone distortion.

In [1]:
import pandas as pd
import numpy as np
import os
import librosa

## Dataset selection & protocol analysis.

In [2]:
# organizing audio samples into training sets.
# creating a dataframe named training_protocols.

training_protocols_path = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.train.trn.txt"

training_protocols = pd.read_csv(training_protocols_path, sep=" ", header=None)
training_protocols.columns = ["speaker", "file_id", "attack", "channel", "label"]
training_protocols["split"] = "train"

# Adding paths of Audio files to dataframe according to their file_id s.

AUDIO_BASE = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_train/flac/"

training_protocols["audio_path"] = training_protocols["file_id"].apply(
    lambda x: f"{AUDIO_BASE}{x}.flac"
)

In [3]:
# Verifying audio paths exist.
training_protocols["file_exists"] = training_protocols["audio_path"].apply(
    lambda x: os.path.exists(x)
)
training_protocols["file_exists"].value_counts()

file_exists
True    25380
Name: count, dtype: int64

In [4]:
training_protocols.describe(include="all")

Unnamed: 0,speaker,file_id,attack,channel,label,split,audio_path,file_exists
count,25380,25380,25380,25380,25380,25380,25380,25380
unique,20,25380,1,7,2,1,25380,1
top,LA_0082,LA_T_9982036,-,A01,spoof,train,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,True
freq,1308,1,25380,3800,22800,25380,1,25380


In [5]:
training_protocols["label"].value_counts()

label
spoof       22800
bonafide     2580
Name: count, dtype: int64

In [6]:
training_protocols[["audio_path", "label", "split"]].head()

Unnamed: 0,audio_path,label,split
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train


In [7]:
# organizing audio samples into validation sets.
# creating a dataframe named development_protocols.

development_protocols_path = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.dev.trl.txt"

development_protocols = pd.read_csv(development_protocols_path, sep=" ", header=None)
development_protocols.columns = ["speaker", "file_id", "attack", "channel", "label"]
development_protocols["split"] = "development"

# Adding paths of Audio files to dataframe according to their file_id s.

AUDIO_BASE = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_dev/flac/"

development_protocols["audio_path"] = development_protocols["file_id"].apply(
    lambda x: f"{AUDIO_BASE}{x}.flac"
)

In [8]:
# Verifying audio paths exist.
development_protocols["file_exists"] = development_protocols["audio_path"].apply(
    lambda x: os.path.exists(x)
)
development_protocols["file_exists"].value_counts()

file_exists
True    24844
Name: count, dtype: int64

In [9]:
development_protocols.describe(include="all")

Unnamed: 0,speaker,file_id,attack,channel,label,split,audio_path,file_exists
count,24844,24844,24844,24844,24844,24844,24844,24844
unique,20,24844,1,7,2,1,24844,1
top,LA_0069,LA_D_9967770,-,A01,spoof,development,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,True
freq,2638,1,24844,3716,22296,24844,1,24844


In [10]:
development_protocols["label"].value_counts()

label
spoof       22296
bonafide     2548
Name: count, dtype: int64

In [11]:
development_protocols[["audio_path", "label", "split"]].head()

Unnamed: 0,audio_path,label,split
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development


In [12]:
# organizing audio samples into testing sets.
# creating a dataframe named testing_protocols.

testing_protocols_path = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.eval.trl.txt"

testing_protocols = pd.read_csv(testing_protocols_path, sep=" ", header=None)
testing_protocols.columns = ["speaker", "file_id", "attack", "channel", "label"]
testing_protocols["split"] = "test"

# Adding paths of Audio files to dataframe according to their file_id s.

AUDIO_BASE = "/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVspoof2019_LA_eval/flac/"

testing_protocols["audio_path"] = testing_protocols["file_id"].apply(
    lambda x: f"{AUDIO_BASE}{x}.flac"
)

In [13]:
# Verifying audio paths exist.
testing_protocols["file_exists"] = testing_protocols["audio_path"].apply(
    lambda x: os.path.exists(x)
)
testing_protocols["file_exists"].value_counts()

file_exists
True    71237
Name: count, dtype: int64

In [14]:
testing_protocols.describe(include="all")

Unnamed: 0,speaker,file_id,attack,channel,label,split,audio_path,file_exists
count,71237,71237,71237,71237,71237,71237,71237,71237
unique,67,71237,1,14,2,1,71237,1
top,LA_0039,LA_E_4689563,-,-,spoof,test,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,True
freq,1784,1,71237,7355,63882,71237,1,71237


In [15]:
testing_protocols["label"].value_counts()

label
spoof       63882
bonafide     7355
Name: count, dtype: int64

In [16]:
testing_protocols[["audio_path", "label", "split"]].head()

Unnamed: 0,audio_path,label,split
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test


In [17]:
label_map = {"bonafide": 0, "spoof": 1}
training_protocols["label_enc"] = training_protocols["label"].map(label_map)
development_protocols["label_enc"] = development_protocols["label"].map(label_map)
testing_protocols["label_enc"] = testing_protocols["label"].map(label_map)

In [18]:
training_protocols[["audio_path", "label", "split", "label_enc"]].head()

Unnamed: 0,audio_path,label,split,label_enc
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train,0
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train,0
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train,0
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train,0
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,train,0


In [19]:
development_protocols[["audio_path", "label", "split", "label_enc"]].head()

Unnamed: 0,audio_path,label,split,label_enc
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development,0
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development,0
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development,0
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development,0
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,bonafide,development,0


In [20]:
testing_protocols[["audio_path", "label", "split", "label_enc"]].head()

Unnamed: 0,audio_path,label,split,label_enc
0,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test,1
1,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test,1
2,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test,1
3,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test,1
4,/kaggle/input/asvpoof-2019-dataset/LA/LA/ASVsp...,spoof,test,1


## Pipeline for Feature Extraction

In [21]:
# 1. Audio preprocessing pipeline.

def audio_preprocessing_pipeline(audio_path, target_sr=22050, target_duration=4.0):
                                     
    target_length = int(target_sr * target_duration)
    
    # Loading & resampling.
    audio, _ = librosa.load(audio_path, sr=target_sr)
    
    # Normalizing.
    audio = librosa.util.normalize(audio)
    
    # Trimming silence.
    audio, _ = librosa.effects.trim(audio)
    
    # Padding or cropping(if any).
    if len(audio) < target_length:
        audio = np.pad(audio, (0, target_length - len(audio)))
    else:
        audio = audio[:target_length]
    
    return audio

# 2. Feature extraction pipeline to generate MFCC feature vectors.

def extract_mfcc_features(
    audio,
    sr=22050,
    n_mfcc=13
):    
    """
    Function Parameters:
        - audio:(output of preprocessing pipeline)
        - sr: sample rate (default 16kHz)
        - n_mfcc: number of MFCC coefficients
    """
    # Extracting MFCCs.
    mfccs = librosa.feature.mfcc(y=audio, n_mfcc=n_mfcc, sr=sr)
    
    # Extracting first / second MFCCs drivatives.
    delta_mfccs = librosa.feature.delta(mfccs)
    delta2_mfccs = librosa.feature.delta(mfccs, order=2)
    #print(mfccs.shape)
    #print(delta_mfccs.shape)
    #print(delta2_mfccs.shape)
    
    mfccs_features_combined = np.vstack([mfccs, delta_mfccs, delta2_mfccs])
    #print(mfccs_features_combined.shape)
    
    # Right now mfccs_features_combined.shape is (39, T), that's a 2d array having time as also a variable,
    # but classical ML models wants fixed lenght vectors or we as 1d arrays example: (78,).
    # For this we will take mean and standard deviation of mfccs_features_combined over duration of audio file (4 secs),
    # and pool them together into one vector of shape (78,) that we can feed into our model.
    
    mfcc_mean = np.mean(mfccs_features_combined, axis=1)
    mfcc_std = np.std(mfccs_features_combined, axis=1)
    
    # Hence MFCC feature vector is created.
    feature_vector = np.concatenate([mfcc_mean, mfcc_std])
    #print(feature_vector.shape)
    
    return feature_vector

## Pipeline for Feature Creation

In [22]:
def full_audio_processing(
    audio_path,
    target_sr=22050,
    target_duration=4.0,
    n_mfcc=13
):
    # Audio preprocessing.
    audio = audio_preprocessing_pipeline(
        audio_path=audio_path,
        target_sr=target_sr,
        target_duration=target_duration
    )

    # MFCC + Delta + Delta-Delta features extraction.
    feature_vector = extract_mfcc_features(
        audio=audio,
        sr=target_sr,
        n_mfcc=n_mfcc
    )

    return feature_vector

## Arranging Target and Predictors

In [23]:
def build_predictors_and_target_arrays(df):
    X = []
    y = []

    for _, row in df.iterrows():
        audio_path = row["audio_path"]
        label = row["label_enc"]

        features = full_audio_processing(audio_path)

        X.append(features)
        y.append(label)

    return np.array(X), np.array(y,)

In [None]:
x_train, y_train = build_predictors_and_target_arrays(training_protocols)
x_dev, y_dev     = build_predictors_and_target_arrays(development_protocols)
x_test, y_test   = build_predictors_and_target_arrays(testing_protocols)

## Scaling Data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_dev_scaled  = scaler.transform(x_dev)
x_test_scaled = scaler.transform(x_test)

In [None]:
print(x_train.shape, y_train.shape)
print(x_dev.shape, y_dev.shape)
print(x_test.shape, y_test.shape)

## Classifier Using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1000, class_weight="balanced",)

lr_model.fit(x_train_scaled, y_train)
# Model has been trained.

In [None]:
# evaluating LR model on Development Dataset.
lr_y_dev_predictions = lr_model.predict(x_dev_scaled)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print("Dev Accuracy Score:")
print(accuracy_score(y_dev, lr_y_dev_predictions))

print("Dev Classification Report:")
print(classification_report(y_dev, lr_y_dev_predictions))

print("Dev Confusion Matrix:")
print(confusion_matrix(y_dev, lr_y_dev_predictions))

In [None]:
lr_y_dev_probability = lr_model.predict_proba(x_dev_scaled)[:, 1]

from sklearn.metrics import roc_curve
import numpy as np

lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_dev, lr_y_dev_probability)
lr_fnr = 1 - lr_tpr

lr_eer_idx = np.nanargmin(np.abs(lr_fnr - lr_fpr))
lr_eer = lr_fpr[lr_eer_idx]
lr_eer_threshold = lr_thresholds[lr_eer_idx]

print("Dev EER:", lr_eer)
print("EER Threshold:", lr_eer_threshold)

## Classifier Using Support Vector Machine

In [None]:
from sklearn.svm import SVC
svm_model = SVC(kernel="rbf", C=1.0, gamma="scale",class_weight="balanced",  probability=True)

svm_model.fit(x_train_scaled, y_train)
# Model has been trained.

In [None]:
# evaluating SVM model on Development Dataset.
svm_y_dev_predictions = svm_model.predict(x_dev_scaled)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print("Dev Accuracy Score:")
print(accuracy_score(y_dev, svm_y_dev_predictions))

print("Dev Classification Report:")
print(classification_report(y_dev, svm_y_dev_predictions))

print("Dev Confusion Matrix:")
print(confusion_matrix(y_dev, svm_y_dev_predictions))

In [None]:
svm_scores = svm_model.decision_function(x_dev_scaled)

from sklearn.metrics import roc_curve
import numpy as np

svm_fpr, svm_tpr, svm_thresholds = roc_curve( y_dev, svm_scores, pos_label=1)

svm_fnr = 1 - svm_tpr

svm_eer_idx = np.nanargmin(np.abs(svm_fnr - svm_fpr))
svm_eer = svm_fpr[svm_eer_idx]
svm_eer_threshold = svm_thresholds[svm_eer_idx]

print("Dev EER:", svm_eer)
print("EER Threshold:", svm_eer_threshold)

## Classifier Using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_split=2, min_samples_leaf=1, class_weight="balanced",random_state=42)

rf_model.fit(x_train_scaled, y_train)
# Model has been trained.

In [None]:
# evaluating RF model on Development Dataset.
rf_y_dev_predictions = rf_model.predict(x_dev_scaled)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("RF Dev Accuracy:")
print(accuracy_score(y_dev, rf_y_dev_predictions))

print("RF Dev Classification Report:")
print(classification_report(y_dev, rf_y_dev_predictions))

print("RF Dev Confusion Matrix:")
print(confusion_matrix(y_dev, rf_y_dev_predictions))

In [None]:
rf_y_dev_probability = rf_model.predict_proba(x_dev_scaled)[:, 1]

from sklearn.metrics import roc_curve
import numpy as np

rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_dev, rf_y_dev_probability)
rf_fnr = 1 - rf_tpr

rf_eer_idx = np.nanargmin(np.abs(rf_fnr - rf_fpr))
rf_eer = rf_fpr[rf_eer_idx]
rf_eer_threshold = rf_thresholds[rf_eer_idx]

print("RF Dev EER:", rf_eer)
print("RF EER Threshold:", rf_eer_threshold)

## Comparing different classifier based on ROC Curves on Development Set

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 7))

plt.plot(lr_fpr, lr_tpr, label=f"Logistic Regression (EER={lr_eer:.4f})")
plt.plot(svm_fpr, svm_tpr, label=f"SVM (EER={svm_eer:.4f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (EER={rf_eer:.4f})")

# Random baseline
plt.plot([0, 1], [0, 1], 'k--', label="Random")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves on Development Set")
plt.legend()
plt.grid(True)
plt.show()

## FInal Results and Model Finalising.

Performance was evaluated using:
* Accuracy
* Precision, Recall, F1-score
* Confusion Matrix
* Receiver Operating Characteristic (ROC) curve
* Equal Error Rate (EER)

Note: “Due to class imbalance, accuracy alone is insufficient; therefore, EER and class-wise recall are emphasized.”

| Model               | Accuracy    | EER        | Strengths              | Weaknesses                  |
| ------------------- | ----------- | ---------- | ---------------------- | --------------------------- |
| Logistic Regression | Medium      | High       | Simple, fast           | Poor non-linearity handling |
| Random Forest       | High        | Medium     | Strong spoof detection | Bias toward majority class  |
| **SVM**             | **Highest** | **Lowest** | Best generalization    | Higher computational cost   |


Final Model Selection is Based on:
* Highest accuracy
* Best ROC curve
* Lowest Equal Error Rate

****Support Vector Machine (SVM) was selected** as the best classical machine learning model for spoof detection in this study.**

## Testing Test data on SVM as baseline model

In [None]:
# evaluating SVM model on Testing Dataset.
svm_y_test_predictions = svm_model.predict(x_test_scaled)

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print("Test Accuracy Score:")
print(accuracy_score(y_test, svm_y_test_predictions))

print("Test Classification Report:")
print(classification_report(y_test, svm_y_test_predictions))

print("Test Confusion Matrix:")
print(confusion_matrix(y_test, svm_y_test_predictions))

In [None]:
svm_test_scores = svm_model.decision_function(x_test_scaled)

from sklearn.metrics import roc_curve
import numpy as np

svm_test_fpr, svm_test_tpr, svm_test_thresholds = roc_curve( y_test, svm_test_scores, pos_label=1)

svm_test_fnr = 1 - svm_test_tpr

svm_test_eer_idx = np.nanargmin(np.abs(svm_test_fnr - svm_test_fpr))
svm_test_eer = svm_test_fpr[svm_test_eer_idx]
svm_test_eer_threshold = svm_test_thresholds[svm_test_eer_idx]

print("Test EER:", svm_test_eer)
print("EER Threshold:", svm_test_eer_threshold)

In [None]:
# Ploting the ROC curve
plt.figure(figsize=(7,7))
plt.plot(svm_test_fpr, svm_test_tpr, label='SVM ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

# Marking the EER point
plt.scatter(svm_test_eer, 1 - svm_test_eer, color='red', s=100, label=f'EER = {svm_test_eer:.4f}')
plt.text(svm_test_eer + 0.02, 1 - svm_test_eer - 0.02, f"EER\n({svm_test_eer:.4f})", color='red')

plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('SVM ROC Curve on Test set with EER')
plt.legend()
plt.grid(True)
plt.show()

| Model       | Accuracy | Precision | Recall | F1-score | ROC-AUC | EER   |
| ----------- | -------- | --------- | ------ | -------- | ------- | ----- |
| SVM (Clean) | 0.88     | 0.98      | 0.89   | 0.93     | 0.97    | 0.133 |

After finalizing the SVM model and freezing all hyperparameters, feature extraction, and scaling strategies, the model was evaluated once on the held-out test set to ensure an unbiased performance estimate.

On the test set, the SVM achieved an accuracy of 88.02%, with a strong F1-score of 0.93 for spoofed samples, indicating effective detection of attack audio. The Equal Error Rate increased to 13.35%, compared to the development set, which is expected due to the unseen nature of test data and the absence of any tuning on test samples.

The ROC curve further confirms stable generalization, with the curve remaining well above the random classifier diagonal, demonstrating good separability between bonafide and spoofed classes.