# Authenticity Analysis of Audio Data Based on CNN Networks Model

# 1 Author

**Student Name**:**Xiongjie Tang** 

**Student ID**:**221169210**

# 2 Problem formulation

With the rapid development of information technology, audio data plays an important role in various application scenarios such as voice assistants, customer service, and online education. However, the information implicit in audio data is not limited to the language content, but also includes non-language features such as the speaker's tone, speaking speed, and pauses. These non-language features are particularly important in tasks such as deception detection, because deception is often accompanied by subtle changes in speech. Therefore, how to effectively extract valuable features from audio and text bimodal data to accurately identify true and false information in audio has become a challenging research topic. In this project, we aim to develop a CNN deep learning model that combines audio features and language types to accurately distinguish true from false audio content.

# 3 Methodology

In the training and validation tasks, we divide the prepared dataset into training set, validation set and test set, with the training set accounting for 60%, the validation set accounting for 15%, and the test set accounting for 25%, to ensure that the model can effectively monitor its generalization ability during the training process and evaluate its performance in the final test stage. The model uses the Adam optimizer, the learning rate is set to 1e-4, and the loss function is selected as binary_crossentropy (binary_crossentropy), which is suitable for binary classification tasks. At the same time, accuracy and AUC are added as evaluation indicators to comprehensively monitor the performance of the model. In order to optimize the training process, Early Stopping is used to monitor the validation loss. If the validation loss does not improve for three consecutive epochs, the training is stopped early and the optimal weight is restored. In addition, the learning rate scheduler (ReduceLROnPlateau) is also used. When the validation loss is no longer reduced, the learning rate is reduced by half to promote further optimization of the model at the local optimal point. In order to deal with the problem of data category imbalance, the category weights are calculated and applied to make the model pay more attention to the minority category during training and improve the recognition rate of the minority category.

During the training process, the custom TQDM progress bar callback class is introduced to display the training progress in real time and to calculate and print the F1 score, precision and recall after each epoch to help monitor the model performance on the validation set. In this way, the study is able to keep abreast of the training dynamics of the model and adjust the training strategy as needed.

After training, we predict each of the three independently trained models on the test set and average the predictions of each model by simple average integration to obtain the final prediction results. Subsequently, the averaged predictions are converted to binary labels and the overall accuracy, F1 score, precision and recall are calculated to fully evaluate the performance of the integrated models. The aim is to construct an audio classification model with robust performance and accurate classification that can effectively recognise different types of story content.

# 4 Implemented ML prediction pipelines

In this study, we build and implement a complete machine learning prediction pipeline designed to accurately classify story genres by analysing audio data. The prediction pipeline consists of multiple interrelated stages, each of which is responsible for handling a specific task, thus ensuring the efficient transformation of data, the accurate training of models, and the reliability of the final prediction results. The input of the entire pipeline is the original audio file and its corresponding tag information, and the output is the classification prediction result on the test data. The flow of data between stages is transmitted through specific data structures to ensure continuity and consistency of information.

First of all we import the necessary libraries and packages

In [20]:
from keras.layers import Input, Embedding, Dense, Flatten, Concatenate
from keras.models import Model

## 4.1 Transformation stage

The conversion phase is the first step in the prediction pipeline, responsible for transforming the raw audio data and label information into a format suitable for model input. In this phase, we first load the tag data from the CSV file and preprocess the ‘Story_type’ column by converting it to lowercase and replacing spaces with underscores to ensure tag consistency. Then, we use the tag encoder to convert the different language categories in the ‘Language’ column into numeric codes, so as to facilitate the subsequent model processing.

Subsequently, we use the Librosa library to extract features from the audio files by calculating the Mel Spectrogram and converting it to a decibel scale. To ensure that all audio samples are consistent in the time dimension, we fill or crop the Mel Spectrogram to 256 frames in the time dimension. At the same time, a channel dimension is added to accommodate the input requirements of the CNN. After these processes, the audio features, language encoding labels and target labels are stored as NumPy arrays and normalised to remove the differences between different feature measures.

After the data preparation, we divide the data into training, validation and test sets with the proportions of 60%, 30% and 10%, respectively. To cope with the category imbalance problem, we calculate the category weights and apply these weights during the training process to ensure that the model treats the categories fairly during training. The outputs of this phase include a normalised array of audio features, an encoded array of language labels, and a corresponding array of target labels, which are used for subsequent model training and validation, respectively.

## 4.2 Model stage

In the modelling phase, we designed and trained multiple CNN models to fully exploit the contribution of audio features and linguistic embedding information to story type classification. Each model consists of an audio input layer that accepts a Mayer spectrogram of shape (128, 256, 1), which is converted into a one-dimensional vector via a spreading layer, and a linguistic input layer. The language input layer, on the other hand, accepts numerically encoded language categories, which are converted into low-dimensional embedding vectors through the embedding layer and processed through the spreading layer.

The audio feature and linguistic embedding vectors are then spliced to form a composite feature vector. This vector is passed through two fully connected layers with 64 and 32 neurons respectively, both with ReLU activation functions to capture higher order features. Finally, the binary classification results are output through a single neuron layer with a Sigmoid activation function. In order to improve the generalisation ability and robustness of the model, we trained three independent CNN models, each with different batch sizes and training cycles during training to explore the impact of different training configurations on the model performance.

During the training process, we employ the Adam optimiser with the learning rate set to 1e-4, use binary cross-entropy as the loss function, and monitor the accuracy and AUC as the evaluation metrics. To ensure the effective learning of the model, we introduce an early stop mechanism to monitor the validation set loss, and if the validation loss no longer decreases for three consecutive cycles, we stop the training and restore the optimal weights. Meanwhile, a learning rate scheduler is used to reduce the learning rate by half when the validation loss no longer decreases within two cycles to help the model jump out of the local optimum. The category weights ensure that the model pays proper attention to minority and majority categories during training by calculating the degree of imbalance in the categories and adjusting for imbalanced data during training.

## 4.3 Ensemble stage

In the integration stage, we aim to improve the overall classification performance by combining the prediction results of multiple models. In this study, we adopt a simple average integration approach to average the prediction results of three independently trained CNN models to obtain more reliable classification results. Specifically, we first predict the three models separately on the test set to obtain three independent predictions. Subsequently, these three predicted values are averaged to obtain the final integrated predicted value.

In order to convert the averaged predicted values into binary labels, we set a threshold of 0.5, and samples exceeding this threshold are classified as positive (1), otherwise as negative (0). Finally, we compute the overall accuracy, F1 score, precision, and recall to fully evaluate the performance of the integrated model. Accuracy measures the proportion of samples that are correctly predicted by the model, F1 score integrates precision and recall and is suitable for evaluating classification performance on unbalanced datasets, precision measures the proportion of samples predicted by the model to be positively categorised that are actually positively categorised, and recall measures the proportion of positively categorised samples that can be recognised by the model.

With this integrated approach, we are able to leverage the strengths of multiple models and reduce the bias and variance that may exist in a single model, thus improving the overall accuracy and robustness of the predictions. In addition, in the future, we can explore more complex integration methods, such as weighted averaging or stacking, to further optimise the prediction performance.

In [21]:
def create_cnn_with_language_embedding(audio_input_shape, num_languages, embedding_dim=8):
    # Audio input and processing
    audio_input = Input(shape=audio_input_shape, name='audio_input')
    audio_features = Flatten()(audio_input)

    # Language category input and embedding layer
    language_input = Input(shape=(1,), name='language_input')
    language_embedding = Embedding(input_dim=num_languages, output_dim=embedding_dim, name='language_embedding')(language_input)
    language_features = Flatten()(language_embedding)

    # feature fusion
    combined_features = Concatenate()([audio_features, language_features])

    # Fully Connected Layer and Categorical Output
    dense_1 = Dense(64, activation='relu')(combined_features)
    dense_2 = Dense(32, activation='relu')(dense_1)
    output = Dense(1, activation='sigmoid', name='output')(dense_2)

    # build a model
    model = Model(inputs=[audio_input, language_input], outputs=output)
    return model

# 5 Dataset

First of all we import the necessary libraries and packages

In [30]:
import pandas as pd
import librosa
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.sequence import pad_sequences

## 5.1 Dataset Description

The datasets for building and evaluating our models will be derived from the MLEnd Deception Dataset. This dataset contains audio recordings of stories along with their corresponding textual transcripts and labels indicating whether the story is true or deceptive. We will create separate training and validation datasets to ensure that our model can be trained effectively and its performance can be accurately evaluated.

## 5.2 Dataset Preprocessing

### 5.2.1 Read label data

First, we use the pandas.read_csv() function to read the label data, containing the language and label of each story (true_story or deceptive_story), and we notice that the type labels of the stories are not formatted consistently, so we manipulate the text in the Story_type column: converting it to lowercase and replacing spaces with underscores (e.g. True Story is converted to true_story), and then the Story_type column is converted to numeric labels by the apply() method: true_story is mapped to 1 and deceptive_story is mapped to 0 and use LabelEncoder to convert language labels from strings to integers.These labels will be used as target values for the training data。

In [31]:
def load_labels_with_language_encoding(csv_path):
    labels_df = pd.read_csv(csv_path)
    labels_df['Story_type'] = labels_df['Story_type'].str.lower().str.replace(" ", "_")
    labels_df['label'] = labels_df['Story_type'].apply(lambda x: 1 if x == 'true_story' else 0)

    # Coding of language categories
    label_encoder = LabelEncoder()
    labels_df['language_encoded'] = label_encoder.fit_transform(labels_df['Language'])
    return labels_df, label_encoder

### 5.2.2 Read audio data

Next, we use the librosa library to read the audio file and extract the Mel spectrogram. mel spectrograms are common features of audio signals and are often used in machine learning tasks for speech or audio.
n_mels: the number of Mel filters (default 128).
hop_length: frame shift parameter that controls the overlap between each frame.
n_fft: size of the FFT (Fast Fourier Transform), which affects the resolution of the spectrum.
power_to_db(): converts the spectrogram to logarithmic scale
We ensure that the time axis part of each Mel spectrogram has a fixed length based on the target length specified by target_length. If the number of columns in the spectrogram is smaller than target_length, it is expanded by padding; if it is larger than the target length, it is cropped. Finally an extra dimension is added, typically used to conform to the input format of a CNN (e.g. a convolutional neural network), here via mel_spec_db[... , np.newaxis] adds a ‘channel’ dimension.

In [32]:
def extract_mel_spectrogram(audio_path, n_mels=128, hop_length=512, n_fft=2048, target_length=256):
    try:
        y, sr = librosa.load(audio_path, sr=None)
        mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, hop_length=hop_length, n_fft=n_fft)
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

        # Fill or cut to target length
        if mel_spec_db.shape[1] < target_length:
            mel_spec_db = np.pad(mel_spec_db, ((0, 0), (0, target_length - mel_spec_db.shape[1])), mode='constant')
        else:
            mel_spec_db = mel_spec_db[:, :target_length]

        # Increased channel dimensions
        mel_spec_db = mel_spec_db[..., np.newaxis]
        return mel_spec_db
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

### 5.2.3 Data Preparation

In the data integration and standardization stage, the audio features, language labels, and target labels are combined together through the prepare_data_without_bert function. First, read the data from the label DataFrame line by line, load the audio file according to the file name, extract its Mel spectrogram features, and obtain the encoded language label and classification label. If the audio file reading fails (for example, the path is wrong or the audio is damaged), the sample is skipped. All extracted audio features, language labels, and classification labels are stored in lists respectively. After all samples are processed, these lists are converted to NumPy arrays for subsequent model training. In order to ensure that the model can learn better, the audio features are further standardized. By calculating the mean and standard deviation of each dimension, the feature values ​​of all samples are standardized, and the feature distribution is adjusted to a mean of 0 and a standard deviation of 1. Finally, the standardized audio features, numerical language labels, and target classification labels are returned.

In [33]:
def prepare_data_without_bert(audio_folder, labels_df, feature_extractor, audio_target_length=256):
    audio_features = []
    language_labels = []
    labels = []

    for index, row in labels_df.iterrows():
        # Get audio file paths and tags
        audio_path = os.path.join(audio_folder, row['filename'])
        label = row['label']
        language_label = row['language_encoded']  

        # Extract audio features
        feature = feature_extractor(audio_path, target_length=audio_target_length)
        if feature is None:
            continue   

        # Add features and tags to the list
        audio_features.append(feature)
        language_labels.append(language_label)
        labels.append(label)

    # Convert to NumPy array
    audio_features = np.array(audio_features)
    language_labels = np.array(language_labels, dtype=np.int32)
    labels = np.array(labels, dtype=np.int32)

    return audio_features, language_labels, labels

# 6 Experiments and results

First of all we import the necessary libraries and packages

In [34]:
import numpy as np
from keras.optimizers.legacy import Adam
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import f1_score, precision_score, recall_score
from keras.optimizers import Adam
from keras.callbacks import Callback
from tqdm import tqdm  

## 6.1 Custom TQDM progress bar callback class

TQDMProgressBar inherits from Keras' Callback and is used to display training progress and calculate F1 score, precision and recall at the end of each epoch.
At the end of each epoch, call on_epoch_end to calculate and print the evaluation metrics.
At the end of each batch, call on_batch_end to print the loss value for the current batch.
At the beginning of training, call on_train_begin to display the progress bar.

In [35]:
class TQDMProgressBar(Callback):
    def __init__(self, val_data, val_labels):
        super(TQDMProgressBar, self).__init__()
        self.val_data = val_data
        self.val_labels = val_labels

    def on_train_begin(self, logs=None):
        self.epochs = self.params['epochs']
        self.epochs_progress = tqdm(total=self.epochs, desc="Training Progress", position=0, ncols=100)

    def on_epoch_end(self, epoch, logs=None):
        self.epochs_progress.update(1)
        # Calculate and print F1 Score, Accuracy Rate, Recall Rate
        y_pred = self.model.predict(self.val_data)
        y_pred_binary = (y_pred > 0.5).astype(int) 
        f1 = f1_score(self.val_labels, y_pred_binary, zero_division=0)
        precision = precision_score(self.val_labels, y_pred_binary, zero_division=0)
        recall = recall_score(self.val_labels, y_pred_binary, zero_division=0)

        print(f"Epoch {epoch + 1} - F1 Score: {f1:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")

    def on_batch_end(self, batch, logs=None):
        if 'loss' in logs:
            tqdm.write(f"Batch {batch} - Loss: {logs['loss']:.4f}")

    def on_train_end(self, logs=None):
        self.epochs_progress.close()


## 6.2 Load data and prepare

First, data needs to be prepared before training begins. In the data preparation stage, the preprocessed audio features, language labels, and target classification labels are divided into training sets, validation sets, and test sets. The division ratio is 60% training set, 20% validation set, and 20% test set. The training set is used to update the model parameters, the validation set is used to evaluate the generalization ability of the model and adjust the hyperparameters, and the test set is used to evaluate the final performance of the model. After dividing the data, the audio features are standardized to ensure that the data distribution is within a stable range, which helps to accelerate model convergence.

Next, three independent CNN models are defined, each with exactly the same structure. Each model consists of two main inputs: audio features and language labels. The audio features are processed by a convolutional neural network (CNN) to extract local features in frequency and time; the language labels are mapped to dense vectors through an embedding layer to represent the semantic information of the language category. These two features are fused in the model, and finally the binary classification task is completed through a fully connected layer, outputting the probability value of each sample belonging to the true_story category. After definition, the three models are compiled using independent Adam optimizers, and the binary_crossentropy loss function and the accuracy and AUC performance indicators are set.

The training part is divided into three independent model trainings. When each model is trained separately, it accepts the audio features and language labels of the training set as input, and the classification label as the target value. The validation set is used to evaluate the performance of the model during the training process to monitor whether overfitting or underfitting occurs. During the training process, the EarlyStopping callback monitors the validation loss. If the validation loss does not improve for three consecutive rounds, the training will stop early. In addition, the ReduceLROnPlateau callback is used to dynamically adjust the learning rate. When the validation loss does not decrease within two rounds, the learning rate will be halved, thereby helping the model to adjust the weight parameters more finely. The training of each model will continue until the early stopping condition is met or the maximum number of training rounds (5 epochs) is reached.

After training, the three models are predicted on the test set. Each model generates a probability value for the test data belonging to the positive class (true_story). These probability values ​​are then integrated to fuse the predictions of the three models by simple averaging. Specifically, for each sample, the average of the predicted probabilities of the three models is taken as the final prediction result. This integration method assumes that each model has expertise in different feature dimensions, and simple averaging can effectively combine the advantages of different models.

Finally, the code converts the probability value of the ensemble prediction into a binary label and applies a threshold of 0.5 to determine the sample category. Subsequently, the code calculates and outputs the performance indicators of the model on the test set, including accuracy, F1 score, precision, and recall. Accuracy indicates the correct rate of the overall classification; F1 score balances precision and recall and is a comprehensive indicator for measuring classification performance; precision focuses on how many of the samples predicted as positive are correct; recall measures how many of the actual positive samples are correctly identified. Through these indicators, the code comprehensively evaluates the classification performance of the ensemble model on the test set.

In [36]:
def main():
    # Path settings
    audio_folder = './Deception-main/CBU0521DD_stories'
    csv_path = './Deception-main/CBU0521DD_stories_attributes.csv'

    # Load tags and language encoders
    labels_df, label_encoder = load_labels_with_language_encoding(csv_path)

    # Data preparation
    X_audio, X_language, y = prepare_data_without_bert(
        audio_folder, labels_df, extract_mel_spectrogram, audio_target_length=256
    )

    # Standardised audio features
    mean = np.mean(X_audio, axis=0)
    std = np.std(X_audio, axis=0)
    X_audio = (X_audio - mean) / (std + 1e-9)

    # Data segmentation
    X_train_audio, X_temp_audio, X_train_language, X_temp_language, y_train, y_temp = train_test_split(
        X_audio, X_language, y, test_size=0.4, random_state=42
    )
    X_val_audio, X_test_audio, X_val_language, X_test_language, y_val, y_test = train_test_split(
        X_temp_audio, X_temp_language, y_temp, test_size=0.25, random_state=42
    )

    # Category weights calculated
    class_weights = class_weight.compute_class_weight(
        class_weight='balanced',
        classes=np.unique(y_train),
        y=y_train
    )
    class_weights_dict = dict(enumerate(class_weights))

    # Create models
    num_languages = len(label_encoder.classes_)
    cnn_model_1 = create_cnn_with_language_embedding(
        audio_input_shape=(128, 256, 1),
        num_languages=num_languages,
        embedding_dim=8  # Embedded dimensions
    )
    cnn_model_2 = create_cnn_with_language_embedding(
        audio_input_shape=(128, 256, 1),
        num_languages=num_languages,
        embedding_dim=8   
    )
    cnn_model_3 = create_cnn_with_language_embedding(
        audio_input_shape=(128, 256, 1),
        num_languages=num_languages,
        embedding_dim=8   
    )

    # Compile each model
    optimizer_1= Adam(learning_rate=1e-4)
    optimizer_2= Adam(learning_rate=1e-4)
    optimizer_3= Adam(learning_rate=1e-4)
    cnn_model_1.compile(optimizer=optimizer_1, loss='binary_crossentropy', metrics=['accuracy', 'AUC'])
    cnn_model_2.compile(optimizer=optimizer_2, loss='binary_crossentropy', metrics=['accuracy', 'AUC'])
    cnn_model_3.compile(optimizer=optimizer_3, loss='binary_crossentropy', metrics=['accuracy', 'AUC'])

    # Training callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
    lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1)

    # Model training
    history_1= cnn_model_1.fit(
        [X_train_audio, X_train_language], y_train,
        validation_data=([X_val_audio, X_val_language], y_val),
        epochs=5,
        batch_size=8,
        class_weight=class_weights_dict,
        callbacks=[early_stopping, lr_scheduler]
    )
    history_2= cnn_model_2.fit(
        [X_train_audio, X_train_language], y_train,
        validation_data=([X_val_audio, X_val_language], y_val),
        epochs=5,
        batch_size=8,
        class_weight=class_weights_dict,
        callbacks=[early_stopping, lr_scheduler]
    )
    history_3= cnn_model_3.fit(
        [X_train_audio, X_train_language], y_train,
        validation_data=([X_val_audio, X_val_language], y_val),
        epochs=5,
        batch_size=8,
        class_weight=class_weights_dict,
        callbacks=[early_stopping, lr_scheduler]
    )
    # Test set predictions
    y_pred_1 = cnn_model_1.predict([X_test_audio, X_test_language])
    y_pred_2 = cnn_model_2.predict([X_test_audio, X_test_language])
    y_pred_3 = cnn_model_3.predict([X_test_audio, X_test_language])
    
    # Simple average integration
    ensemble_predictions = (y_pred_1 + y_pred_2 + y_pred_3) / 3
    
    # Converted to binary labels
    y_pred_binary = (ensemble_predictions > 0.5).astype(int)
    
    accuracy = np.mean(y_pred_binary.flatten() == y_test.flatten()) * 100
    f1 = f1_score(y_test, y_pred_binary)
    precision = precision_score(y_test, y_pred_binary)
    recall = recall_score(y_test, y_pred_binary)


    print(f"Ensemble Accuracy: {accuracy:.2f}%")
    print(f"F1 Score: {f1:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")

Start training:

In [38]:
if __name__ == "__main__":
    main()

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 4: ReduceLROnPlateau reducing learning rate to 4.999999873689376e-05.
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 4: ReduceLROnPlateau reducing learning rate to 4.999999873689376e-05.
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 3: ReduceLROnPlateau reducing learning rate to 4.999999873689376e-05.
Epoch 4/5
Ensemble Accuracy: 70.00%
F1 Score: 0.8000, Precision: 0.7500, Recall: 0.8571


## 6.3 Result anaylsis

### 6.3.1 Training process analysis

- **Loss value (Loss)**: During the training phase, the loss value drops rapidly from a high initial value (e.g., 1.3098) to close to 0 (e.g., 0.0326). This indicates that the model has learned features on the training set and is able to fit the training data.
- **Accuracy**: The training accuracy quickly rises from an initial low value (such as 36.67%) to 90% or even 100%. This indicates that the model is able to fit the training data well, but because the training accuracy increases so quickly and approaches 100%, this may be a sign of overfitting.
- **AUC value**: During the training phase, the AUC rises quickly to close to 1.0 (e.g., 1.0000). An AUC close to 1 indicates that the model is able to distinguish positive and negative samples well, but due to the weak performance of the validation AUC, the improvement in the training AUC may reflect more of the model overfitting the training data.

### 6.3.2 Validation process analysis

- **Loss values (val_loss)**: The change in validation loss shows large fluctuations. For example, the validation loss of the first model is 1.3521 in epoch 2, but rises to 2.2682 in epoch 5. This fluctuation suggests that the model may start to overfit in the late stages of training, that is, it tends to memorize the training data rather than learn broadly applicable features.
- **Accuracy (val_accuracy)**: The validation accuracy improves slightly from 40.00% in Round 2 in the first model, but stagnates or even decreases in subsequent rounds. For example, in the last round, the validation accuracy is 40.00%, showing that the model does not generalize well to the validation data.
- **AUC value (val_auc)**: The validation AUC fluctuates greatly and is generally low. For example, the validation AUC of the first model is 55.26% in the second round, but then continues to drop to 34.45% in the final round. This indicates that the model fails to effectively distinguish between positive and negative samples on the validation set.

### 6.3.3 Test result analysis

- **Accuracy (Test Accuracy)**: The accuracy of the integrated model on the test set is 70.00%, which is significantly higher than the performance on the validation set. This shows that the ensemble strategy effectively improves the overall classification performance and weakens the overfitting tendency of a single model.
- **F1 Score, Precision and Recall**: The F1 score of the ensemble model is 0.8000, which is a high value. The F1 score is the harmonic mean of precision and recall, and is particularly useful for tasks where the balance between false positives and false negatives is important. A high F1 score indicates that the overall classification performance of the model on the test set is relatively balanced.

# 7 Conclusions

In this task, we built a binary classification task based on audio and language features, with the goal of classifying audio stories into true and false (true_story and deceptive_story). By designing multiple convolutional neural network (CNN) models, processing the Mel spectrogram and language category features of the audio respectively, and fusing the predictions of multiple models through the ensemble learning strategy, the overall classification performance was improved.

## 7.1 Highlights of the task model

- **Multi-input model design**:The integration of audio and language feature streams enables the model to extract more comprehensive features from multimodal data.
- **Application of integration strategy**：The use of simple average integration strategy effectively alleviates the problem of overfitting of a single model and improves the robustness and generalization ability of the model.
- **Reasonable training process**：EarlyStopping and ReduceLROnPlateau are used to dynamically control the training process, which improves training efficiency and reduces unnecessary overfitting risks.

## 7.2 Problems encountered so far

-The current data size (100 samples) is severely insufficient to train a model with generalisation capabilities. The insufficient data size also leads to overfitting problems, making the model overfits the details of the training data and is difficult to generalize to the validation set and test set.

-Inconsistent performance between the validation set and the test set:Due to the insufficient sample size, the validation set may not fully represent the data distribution, resulting in large fluctuations in validation performance, while the test set may happen to be distributed closer to the training set, so the test performance is better than the validation performance.

-From the results, the performance of the validation set is much lower than that of the training set and the test set. This may be due to the difference in the distribution of the validation set and the training set and the test set. This distribution difference will cause the model to be unable to evaluate the actual generalization performance well during the validation phase, thus affecting the effectiveness of the early stopping mechanism and the learning rate adjustment.

-Although the simple average ensemble strategy improves the overall performance of the model, its potential has not been fully utilized:Currently, the three models have exactly the same structure, and only the weight update during the training process is different, and the model diversity is insufficient.There is no weighted integration of the performance of individual models, and only simple averaging is used, which fails to fully utilize the advantages of each model.

## 7.3 Directions for improvement

-**Increase the amount of data**:We can enhance the audio data, such as adding noise, time stretching, pitch changes, etc., to increase data diversity,
or get more samples:Expand the data source and get more story audio samples to improve the generalization ability of the model.

-**Optimize validation set division**:We can ensure that the proportion of each type of sample in the training, validation and test sets is consistent,
or evaluate model performance through K-fold cross-validation to reduce the impact of uneven distribution of validation set

-**Model architecture**:More complex neural network structures such as Deep CNN, RNN or Self-Attention Mechanism can be explored to capture temporal information and long-range dependencies in audio data. In addition, combining multimodal learning approaches to fuse audio features with other types of data (e.g., text or metadata) may further enhance the classification performance. In order to optimise the model training process, one can try to use more advanced optimisers and learning rate scheduling strategies, or adopt a migration learning approach, using models pre-trained on large-scale audio data as a basis for fine-tuning to suit specific classification tasks.

# 8 References

For more details,I have upload the structured code to the github you can check it by enter the link below:

https://github.com/daydreamer17/bupt-iot_ml_miniproject2024.git