This repository contains the dataset produced for the paper Perceiving Music
Quality with GANs [1] in collaboration
between Peltarion and Epidemic Sound. The purpose is to evaluate methods for
quality rating music. It contains 975 segments from songs across 13 genres,
with degradations of various intensities applied. Each clip has an associated
human perceived quality rating, from 1 (Bad
) to 5 (Excellent
), which is the
median value of the rating assigned by 5 different people.
The dataset consists of a CSV file with metadata (rating and information about each segment, including genre, artist, track) and the corresponding audio segments. These may be downloaded and used directly. For simplicity, we provide code to load the data in both PyTorch and Tensorflow.
The following data is hosted in the releases of this GitHub repo:
URL | Description |
---|---|
https://github.com/carlthome/pmqd/releases/download/v1.0.0/audio.tgz | Archive with all music segments at 48kHz / 24-bit. |
https://github.com/carlthome/pmqd/releases/download/v1.0.0/audio32.tgz | Archive with all music segments at the increased bit depth 48kHz / 32-bit. |
https://github.com/carlthome/pmqd/releases/download/v1.0.0/pmqd.csv | Metadata (ratings and song information). |
Install pmqd
with additional PyTorch and
torchaudio dependencies:
> pip install git+https://github.com/carlthome/pmqd#egg=pmqd[torch]
To download the dataset to "download_directory"
and use it:
from pmqd.torch import PMQD
from torch.utils.data import DataLoader
dataset = PMQD(root="download_directory", download=True)
dataloader = DataLoader(dataset, batch_size=32)
sample_rate = PMQD.SAMPLE_RATE
...
for batch in dataloader:
audio, rating = batch["audio"], batch["rating"]
...
This requires ffmpeg on the target system, and has to be
installed manually (e.g. for MacOS do brew install ffmpeg
(formula)). Alternatively,
see the docker instructions below.
Install pmqd
with additional Tensorflow and
Tensorflow Datasets dependencies:
> pip install git+https://github.com/carlthome/pmqd#egg=pmqd[tfds]
To download the dataset to the default tfds
location and use it:
import pmqd.tfds
import tensorflow_datasets as tfds
dataset, info = tfds.load("PMQD", split="test", with_info=True)
dataset = dataset.batch(32)
sample_rate = info.features["audio"].sample_rate
...
for batch in dataset:
audio, rating = batch["audio"], batch["rating"]
...
The repository contains an appropriate docker image with all dependencies
required for both tfds
and torch
. It can be built directly from the
repository. To build and open a prompt inside it, do:
docker build --target pmqd -t pmqd https://github.com/carlthome/pmqd.git#main
docker run -it pmqd bash
Example notebooks for usage with tfds
and torch:
The rated music segments are available as 48kHz / 24-bit PCM stereo mixes. For
compatibility with e.g. torchaudio
they are also available at an increased
bit depth of 32-bits.
The metadata table contains the following columns:
- id: ID of sample.
- genre: Genre.
- artist: Artist.
- title: Title.
- degradation_type: Type of applied degradation (see Degrading audio quality).
- degradation_intensity: Intensity of the applied degradation from 0 (none) to 100 (maximum)
- sample_start: The start location in seconds of this segment in the full song.
- sample_filename: Filename of the sample.
Below is a random sample of the contents:
id | genre | artist | title | degradation_type | degradation_intensity | rating | sample_start | sample_filename |
---|---|---|---|---|---|---|---|---|
137 | Blues | Martin Carlberg | Bad Bad Blood (Instrumental Version) | original | 0.00 | 5.00 | 130 | 04694c6cb0cb4833906259ee961d53b8.wav |
448 | Rnb & Soul | Park Lane feat. Vincent Vega | I Don't Wanna Be You | limiter | 90.16 | 4.00 | 199 | 0b6856dacd8d4c19ad9f25e4f2fe3f02.wav |
850 | Blues | Henrik Nagy | Strolling In New Orleans 1 | noise | 20.67 | 2.00 | 34 | 71982b5dbf1a4d6f86e9638fb3574f97.wav |
80 | Country | Martin Carlberg | Appalachian Trail 2 | original | 0.00 | 4.00 | 98 | d4404cbff14244d58450fa5c73c97481.wav |
675 | Funk | Teddy Bergström | Godspel Groove | lowpass | 63.92 | 3.00 | 31 | 5df0c95dda78408aa7f73fbad7c029cc.wav |
The original audio is from the Epidemic Sound catalog, an online service with professionally produced, high-quality music. It contains a wide range of music and is curated to conform well to contemporary music, as it is intended for use by content creators. Using their catalog, we created a balanced dataset of mutually exclusive genres by randomly sampling 5 songs from each selected genre. For each song, we then randomly sample 3 segments of constant length, approximately 4 seconds in duration.
To include tracks of varying quality we used a set of signal degradations with the following open-source REAPER JSFX audio plugins:
- Distortion (
loser/waveShapingDstr
): Waveshaping distortion with the waveshape going from a sine-like shape (50%) to square (100%). - Lowpass (
Liteon/butterworth24db
): Low-pass filtering, a 24 dB Butterworth filter configured to have a frequency cutoff from 20 kHz down to 1000 Hz. - Limiter (
loser/MGA\_JSLimiter
): Mastering limiter, having all settings fixed except for the threshold that was lowered from 0 dB to -30 dB (introduces clipping artifacts). - Noise (
Liteon/pinknoisegen
): Additive noise on a range from -25 dB (subtly audible) to 0.0 dB (clearly audible). - Original: The original music segment.
Plugins were applied separately to each segment without effects chaining. The parameter of each plugin is rescaled to [0, 100] and considered the intensity of the degradation. notebooks/degrade_audio.ipynb shows how to use the code used to degrade the original music segments.
From each original segment described in Source, we produce degraded versions of each type of degradation with a randomly sampled intensity from the uniform distribution of the range. This yields 75 music segments per genre, and our dataset thus consists of the following number of samples:
Genre | Count |
---|---|
Acoustic | 75 |
Blues | 75 |
Classical | 75 |
Country | 75 |
Electronica & Dance | 75 |
Funk | 75 |
Hip Hop | 75 |
Jazz | 75 |
Latin | 75 |
Pop | 75 |
Reggae | 75 |
Rnb & Soul | 75 |
Rock | 75 |
Total | 975 |
To annotate the music segments with their human perceived listening quality we turn to crowdsourcing the task on Amazon Mechanical Turk (AMT). This has the advantage of allowing significantly larger scale than controlled tests, though introduces some potential problems such as cheating and underperforming participants, which we handle as described in this section.
Tasks to be completed by human participants are created to rate segments for their listening quality. Segments are randomly assigned to tasks such that each task contains 10 segments, never contains duplicates, and each segment occurs in at least 5 tasks. Participants may only perform one task in order to avoid individuals biases. In total, we produce 488 tasks resulting in 4880 individual segment evaluations.
During a task, each participant is asked to specify which type of device they
will use for listening from the list: smartphone speaker
, speaker
,
headphones
, other
, will not listen
. If any other option than speaker
or
headphones
was selected, the submission was rejected and the task
re-assigned. For each segment in the task, we ask the user for an assessment
of audio quality, not musical content[2]. The question is phrased as:
"How do you rate the audio quality of this music segment?", and may be
answered on the ordinal scale: Bad, Poor, Fair, Good and Excellent,
corresponding to the numerical values 1, 2, 3, 4, 5.
Once all tasks are completed, the ratings are aggregated to produce one perceived quality rating per segment. Since participants are listening in their own respective environments, we are concerned with lo-fi audio equipment or scripted responses trying to game AMT. Thus we use the median over the mean rating to discount outliers.
The following schemes are applied in an attempt to reduce cheating or participants not following instructions:
- Filtering out submissions on undesired devices also increases the chance of rejecting bots
- Multiple submissions by the same participant despite warnings that this will lead to rejection are all rejected
- Tasks completed in a shorter amount of time than the total duration of all segments in the task are rejected
- Tasks where all segments are given the same rating despite large variation in degradation intensity are rejected
- The number of tasks available at any moment is restricted to 50, as a smaller amount has been shown to decrease the prevalence of cheating[3].
The following are the average ratings assigned by degradation intensity and distortion type. As expected, some degradations (distortion and noise) have a much larger impact on the quality than others. Furthermore, we note that the original tracks are on average rated below the excellent quality mark, despite being high-fidelity recordings. Part of this could be explained by the annotators expectations.
Degradation intensity | Distortion | Limiter | Lowpass | Noise | Original |
---|---|---|---|---|---|
[0.0, 20.0) | 3.05 | 4.04 | 3.97 | 3.47 | 4.02 |
[20.0, 40.0) | 2.69 | 3.72 | 4.00 | 3.00 | - |
[40.0, 60.0) | 2.39 | 3.86 | 3.82 | 2.37 | - |
[60.0, 80.0) | 2.17 | 3.90 | 3.55 | 1.78 | - |
[80.0, 100.0) | 1.59 | 3.74 | 3.31 | 1.37 | - |
If you use this in your research, please cite our paper as
@article{hilmkil2020perceiving,
title={Perceiving Music Quality with GANs},
author={Hilmkil, Agrin and Thom{\'e}, Carl and Arpteg, Anders},
journal={arXiv preprint arXiv:2006.06287},
year={2020}
}
[1] Hilmkil, A., Thomé, C. and Arpteg, A., 2020. Perceiving Music Quality with GANs. arXiv preprint arXiv:2006.06287.
[2] Wilson, A. and Fazenda, B.M., 2016. Perception of audio quality in productions of popular music. Journal of the Audio Engineering Society, 64(1/2), pp.23-34.
[3] Eickhoff, C. and de Vries, A.P., 2013. Increasing cheat robustness of crowdsourcing tasks. Information retrieval, 16(2), pp.121-137.