Skip to content

The proposed model leverages the power of transformers using the UGM dataset by Dr. Vinit Jakhetia's research group to process audio data, providing enhanced performance over traditional approaches.

Notifications You must be signed in to change notification settings

ashutoshc8101/audio-quality-assessment

Repository files navigation

Audio Quality Assessment with Transformer-Based Learning

This GitHub project introduces a novel approach to audio quality assessment using transformer-based deep learning architecture. The proposed model leverages the power of transformers to process audio data, providing enhanced performance over traditional approaches. This README provides an overview of the architecture, model configuration, and the tools used for this project.

Architecture Overview and Model Configuration

The proposed model employs a transformer-based deep learning approach to assess audio quality. It takes hand-crafted features concatenated into a vector as input and is trained with corresponding ground-truth labels. The transformer architecture, comprising an Encoder-Decoder structure with Multi-Head Attention (MHA) and Feed-Forward layers, processes the data. We utilized four layers of the encoder, set the number of heads (h) in each MHA to four, and employed an Adam optimizer. The model outputs a single continuous value representing audio quality in the range of 1 to 5. These design choices optimize feature vectors while considering attention mechanisms for enhanced performance.

We integrated the Dual Encoder Cross attention proposed in [2] with the model proposed in [1]. There are 4 layers and in each layer 3 attention blocks are used. Each attention block has 4 attention heads. The two blocks take their key, query and values and inputs. Thirds block takes output of block1 as Query and Values and output of block 2 as Key. This Propose model showed better results as shown in Results Section.

image image1

Results

The proposed architecture with Dual encoder cross attention has been trained on the concatenated features of MFCC + MelSpectogram + Chroma CQT. The results are stored in table 1.

Table 1: Comparison of performance of proposed model against the model in [1] which performs better than other quality techniques

Metric PLCC SRCC KRCC
Proposed Model 0.828 0.823 0.629
Model propose in [1] 0.816 0.812 0.613
Proposed model with 4 attention head in cross attention block 0.823 0.821 0.619
image

Ablation Study

Table 2: Ablation Study of model proposed in [1] on different features individually.

Features PLCC SRCC KRCC
MFCC 0.642 0.623 0.449
MelSpectogram 0.578 0.566 0.400
Chroma CQT 0.321 0.345 0.241
SPectral Contrast 0.227 0.207 0.141

To study the contribution of different features we trained the model proposed in [1] on individual features. The correlation between the predicted output of the trained model and actual values is stored in Table21. It shows the best features are in order MFCC > Melspectogram > Chroma CQT > SPectral COntrast. Other than these none of the features(PNCC, Spectral Centroid) showed promising results.


Table 3: Ablation Study of model proposed in [1] on different combination of features.

Experiments PLCC SRCC KRCC
MFCC + MelSpectogram + Chroma CQT 0.816 0.812 0.613
MFCC + MelSpectogram + Spectral Contrast 0.747 0.736 0.5430
MFCC + Melspectogram + Chroma CQT + Spectral Contrast 0.730 0.726 0.538
MFCC + Melspectogram + Chroma CQT + SPectral Contrast + PNCC 0.721 0.716 0.530
MFCC + Melspectogram + Chroma CQT + PNCC 0.297 0.445 0.305

As individual contribution is not enough to arrive at a conclusion, we also studied the performance of the model on the input of different combinations of features. To study this we trained the model proposed in [1] on the combinations shown in table 3. The correlation between the predicted output of the trained model and actual values is also shown in Table 2. As the dataset used has 2075 audio samples, concatenating too many features or features having large dimensions results in degradation of results due to Curse of Dimensionality.

We propose that in case of larger dataset, which can also be made using data augmentation, MFCC + Melspectogram + Chroma CQT + Spectral Contrast should be used but for this study we used MFCC + Melspectogram + Chroma CQT

Dependencies

To install the required dependencies, simply run the following command:

pip install -r requirements.txt

Please ensure that you have these libraries installed to run the project.

Usage

To use this project, follow these steps:

  1. Clone this GitHub repository.
  2. Install the required dependencies.
  3. Train the model on your audio quality assessment dataset.
  4. Evaluate the model's performance.

Acknowledgment

We would like to acknowledge the support and contributions of the open-source community in making this project possible. Additionally, we extend our gratitude to the following researchers and their papers:

1. Transformer-based quality assessment model for generalized user-generated multimedia audio content

Dataset Credits: The dataset used in this project was generously provided by Mumtaz, D., Jena, A., Jakhetiya, V., Nathwani, K., and Guntuku, S.C. as described in their paper, "Transformer-based quality assessment model for generalized user-generated multimedia audio content" (Proc. Interspeech 2022, 674-678, doi: 10.21437/Interspeech.2022-10386).

2. Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

We acknowledge the work of C. Liu, D. Liu, and L. Mu as described in their paper, "Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River" (IEEE Access, vol. 10, pp. 58240-58253, 2022, doi: 10.1109/ACCESS.2022.3178521).

We appreciate the valuable contributions of these researchers and the resources they provided for our project.

Team Members

This project was made possible by the efforts of our team members:

  • Ashutosh Chauhan
  • Dakshi Goel
  • Aman Kumar
  • Devin Chugh
  • Shreya Jain

Contributing

We welcome contributions to enhance this project. If you would like to contribute, please follow the standard GitHub pull request process.

For any questions or issues, please open a GitHub issue in this repository.

Thank you for your interest in our audio quality assessment project!

About

The proposed model leverages the power of transformers using the UGM dataset by Dr. Vinit Jakhetia's research group to process audio data, providing enhanced performance over traditional approaches.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published