# [Audio Story Authenticity Classification: A predictive model report based on Machine learning]

# 1 Author

**Student Name**: Junying Lin  
**Student ID**:  bupt:2022213482 qm:221169564



# 2 Problem formulation

The machine learning problem at hand is a binary classification task aimed at distinguishing between true stories and false stories based on audio features extracted from audio files. The interesting aspect of this problem lies in the challenge of accurately classifying narrative content using audio data, which involves understanding the nuances in speech patterns, tone, and other acoustic characteristics that may indicate truthfulness or deception.

# 3 Methodology

The methodology involves training and validating machine learning models to predict the class label ('True Story' or 'False Story') of audio files. Model performance is defined using accuracy, precision, recall, F1-score, and a confusion matrix. The training task involves feeding the models with audio features extracted from preprocessed audio files, while the validation task assesses the model's ability to generalize to unseen data.

# 4 Implemented ML prediction pipelines

The ML prediction pipelines implemented in this project involve three main stages: transformation, model training, and ensemble prediction. Each stage plays a critical role in the overall process of building and evaluating the machine learning models.

## 4.1 Transformation stage

The transformation stage involves preprocessing the audio files to extract relevant features that can be used for training machine learning models. The input is an audio file, and the output is a set of Mel-frequency cepstral coefficients (MFCCs) which are commonly used in speech recognition tasks. The choice of MFCCs is motivated by their ability to capture the timbral aspects of audio signals, which are crucial for distinguishing between different types of speech patterns.

Input: Audio files in WAV format.  
Output: A numerical array of MFCC features for each audio file.

In [10]:
import os
import pandas as pd
import librosa
import numpy as np
from pydub import AudioSegment

# Function to preprocess an audio file
def preprocess_audio(file_path):
    print("Preprocessing audio file:", file_path)
    audio = AudioSegment.from_wav(file_path)
    audio = audio.set_frame_rate(16000).set_channels(1)
    audio_data, _ = librosa.load(file_path, sr=16000)
    mfccs = librosa.feature.mfcc(y=audio_data, sr=16000, n_mfcc=13)
    mfccs_processed = np.mean(mfccs.T, axis=0)
    return mfccs_processed

# Function to load data
def load_data(audio_dir, csv_file):
    print("Loading data...")
    attributes = pd.read_csv(csv_file)
    audio_files = [os.path.join(audio_dir, f"{idx:05d}.wav") for idx in range(1, 101)]
    data = {
        "audio_features": [],
        "labels": []
    }
    for file in audio_files:
        filename = os.path.basename(file)
        story = attributes[attributes['filename'] == filename]
        story_type = story['Story_type'].values[0]
        print("Processing file:", file)
        data['audio_features'].append(preprocess_audio(file))
        data['labels'].append(1 if story_type == 'True Story' else 0)
    print("Data loaded.")
    return data

# Main function to load and return data
def main():
    audio_dir = 'CBU0521DD_stories'
    csv_file = 'CBU0521DD_stories_attributes.csv'
    data = load_data(audio_dir, csv_file)
    return data

if __name__ == "__main__":
    data = main()

Loading data...
Processing file: CBU0521DD_stories\00001.wav
Preprocessing audio file: CBU0521DD_stories\00001.wav
Processing file: CBU0521DD_stories\00002.wav
Preprocessing audio file: CBU0521DD_stories\00002.wav
Processing file: CBU0521DD_stories\00003.wav
Preprocessing audio file: CBU0521DD_stories\00003.wav
Processing file: CBU0521DD_stories\00004.wav
Preprocessing audio file: CBU0521DD_stories\00004.wav
Processing file: CBU0521DD_stories\00005.wav
Preprocessing audio file: CBU0521DD_stories\00005.wav
Processing file: CBU0521DD_stories\00006.wav
Preprocessing audio file: CBU0521DD_stories\00006.wav
Processing file: CBU0521DD_stories\00007.wav
Preprocessing audio file: CBU0521DD_stories\00007.wav
Processing file: CBU0521DD_stories\00008.wav
Preprocessing audio file: CBU0521DD_stories\00008.wav
Processing file: CBU0521DD_stories\00009.wav
Preprocessing audio file: CBU0521DD_stories\00009.wav
Processing file: CBU0521DD_stories\00010.wav
Preprocessing audio file: CBU0521DD_stories\0001

## 4.2 Model stage

The model stage involves the creation and training of two different machine learning models: a Multi-Layer Perceptron (MLP) and a Random Forest Classifier.  

**MLP Model:**  
Architecture: The MLP consists of three fully connected layers with ReLU activations.  
Input: MFCC feature vectors.  
Output: A probability distribution over two classes (true story or false story).  
Rationale: MLPs are chosen for their ability to model complex, non-linear relationships in the data. They are suitable for capturing the intricate patterns in speech that may be indicative of deception or truth.  
  
  
**Random Forest Classifier:**  
Architecture: An ensemble of decision trees.  
Input: MFCC feature vectors.  
Output: A class label (true story or false story).  
Rationale: Random Forests are chosen for their robustness to overfitting and their ability to handle high-dimensional data. They provide a good baseline for comparison with other models due to their versatility and ease of use.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device set to:", device)

# Define a simple Multi-Layer Perceptron (MLP) model
class MLP(nn.Module):
    def __init__(self, input_dim):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 2)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Function to train the MLP model
def train_mlp_model(features, labels):
    print("Training MLP model...")
    features_array = np.array(features)
    features_tensor = torch.tensor(features_array, dtype=torch.float32).to(device)
    labels_tensor = torch.tensor(labels, dtype=torch.long).to(device)
    
    model = MLP(features_tensor.shape[1]).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    model.train()
    for epoch in range(10):  # Increase the number of training epochs to improve accuracy
        optimizer.zero_grad()
        outputs = model(features_tensor)
        loss = criterion(outputs, labels_tensor)
        loss.backward()
        optimizer.step()
        print(f'Epoch {epoch+1}, Loss: {loss.item()}')
    print("MLP model training complete.")
    return model

# Function to train the Random Forest model
def train_random_forest(audio_features, labels):
    print("Training Random Forest model...")
    clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    clf.fit(audio_features, labels)
    print("Random Forest model training complete.")
    return clf

# Function to split data into training and testing sets, and train both models
def train_models(audio_features, labels):
    X_train, X_test, y_train, y_test = train_test_split(audio_features, labels, test_size=0.2, random_state=42)
    mlp_model = train_mlp_model(X_train, y_train)
    rf_model = train_random_forest(X_train, y_train)
    return mlp_model, rf_model, X_test, y_test

if __name__ == "__main__":
    # Assuming 'data' is the data loaded from the first cell
    mlp_model, rf_model, X_test, y_test = train_models(data['audio_features'], np.array(data['labels']))

Device set to: cpu
Training MLP model...
Epoch 1, Loss: 4.426042079925537
Epoch 2, Loss: 1.0982061624526978
Epoch 3, Loss: 1.663364052772522
Epoch 4, Loss: 0.8481653332710266
Epoch 5, Loss: 1.2898378372192383
Epoch 6, Loss: 0.9568106532096863
Epoch 7, Loss: 0.8622147440910339
Epoch 8, Loss: 1.0425703525543213
Epoch 9, Loss: 0.7478941082954407
Epoch 10, Loss: 0.8406487703323364
MLP model training complete.
Training Random Forest model...
Random Forest model training complete.


## 4.3 Ensemble stage

The ensemble stage combines the predictions of the MLP and the Random Forest Classifier to produce a final prediction.

**Voting System:**
If the two models agree on a prediction, that prediction is chosen.  
If there is a disagreement, a random choice is made between the two classes with equal probability.

Rationale:  
Ensemble methods are used to improve the overall performance of the model by combining the strengths of individual models. The random choice in case of a tie is a simple way to break ties without biasing towards any class.

In [12]:
from sklearn.metrics import accuracy_score, classification_report

# Function for ensemble prediction using both models
def ensemble_predict(mlp_model, rf_model, audio_features):
    # Convert audio features to a tensor
    audio_features_tensor = torch.tensor(np.array(audio_features), dtype=torch.float32).to(device)
    
    # Use the MLP model for prediction
    mlp_model.eval()
    with torch.no_grad():
        mlp_outputs = mlp_model(audio_features_tensor)
    mlp_predictions = mlp_outputs.argmax(dim=1).cpu().numpy()
    
    # Use the Random Forest model for prediction
    rf_predictions = rf_model.predict(audio_features)
    
    # Voting system: if both models agree on a prediction, use that result; otherwise, randomly choose between the two
    ensemble_predictions = np.where(mlp_predictions == rf_predictions, mlp_predictions, np.random.choice([0, 1], p=[0.5, 0.5]))
    
    return ensemble_predictions

# Function to evaluate the ensemble model and print accuracy and classification report
def evaluate_ensemble_model(mlp_model, rf_model, X_test, y_test):
    ensemble_predictions = ensemble_predict(mlp_model, rf_model, X_test)
    ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
    print("Ensemble Accuracy:", ensemble_accuracy)
    # Add the zero_division parameter to control the behavior when precision is ill-defined
    print("Classification Report:\n", classification_report(y_test, ensemble_predictions, zero_division=0))

if __name__ == "__main__":
    # Assuming 'data' is the data loaded from the first cell
    # Assuming 'mlp_model' and 'rf_model' are the models trained in the second cell
    evaluate_ensemble_model(mlp_model, rf_model, X_test, y_test)

Ensemble Accuracy: 0.65
Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.78      0.67         9
           1       0.75      0.55      0.63        11

    accuracy                           0.65        20
   macro avg       0.67      0.66      0.65        20
weighted avg       0.68      0.65      0.65        20



# 5 Dataset

The dataset is based on the MLEnd Deception Dataset, which consists of audio files labeled as either true stories or false stories. The dataset is constructed by loading the audio files and their corresponding labels from a CSV file. The dataset is then split into training and testing sets, with 80% of the data used for training and 20% for testing. This split ensures that the models are evaluated on independent and identically distributed (IID) samples, which is crucial for assessing their generalization capabilities.

Exploration of the dataset involves examining the distribution of the classes and visualizing the MFCC features to understand their variability across different audio files. This step is essential for identifying any potential biases or imbalances in the dataset.

# 6 Experiments and results

The experiments involve training the MLP and Random Forest models on the training set and evaluating their performance on the testing set. The results show an ensemble accuracy of 0.65, with precision, recall, and F1-score values indicating room for improvement, particularly for the false story class.

# 7 Conclusions

The final performance of the ensemble model indicates that while some level of accuracy has been achieved, there is significant room for improvement. The classification report highlights the need for better recall and precision, especially for the false story class. Suggestions for improvements include exploring more sophisticated feature extraction techniques, experimenting with different model architectures, and employing more advanced ensemble methods.

# 8 References

The implementation of the models and the methodology followed in this project were informed by various resources, including textbooks on machine learning, online tutorials on audio processing with Python, and documentation for the libraries used (such as PyTorch, scikit-learn, and librosa). Specific references include the following:

Librosa: Audio and Music Analysis in Python
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Scikit-learn: Machine Learning in Python