# Exploring AcousticBrainz classifier stability

AcousticBrainz has a large amount of classifier information available, and is used quite often as a source for psychological claims based on music like listening preferences over time, the influence of the seasons on our preferences, or claims that [pop music is getting sadder](http://www.bbc.com/culture/story/20190513-is-pop-music-really-getting-sadder-and-angrier).

Low level features are known to be unstable (see reference in my notes), so the hypothesis is that these results from the AcousticBrainz classifiers are very much dependent on things like source quality. Furthermore, high-level features have additional problems like which emotion model do you use, and differences in interpretation, for example: how do you interpret a 'party mood'? Furthermore, the 'ground truth' that these models are trained on is also subjective. If scientific claims made using the results from such classifiers as a basis, then these claims might not be true if the classifiers are unreliable.

Due to the crowdsourcing nature of AcousticBrainz, multiple submissions exist for the same recording, meaning that the classifier has been run multiple times over different submissions of the same recording. If these classifiers are accurate, then the results should remain fairly stable when minor variations in for example audio quality occur - a sad song should not become happy if the quality is higher, for example. 

Thus, the first question to answer is: **How stable are the classifiers included in AcousticBrainz** and a second question that arises is **Which classifiers are relatively stable, and which classifiers are relatively unstable?**


First, we import all required packages and load in the acousticbrainz dataset which was generated by running the scripts in ```acousticbrainz_data_generation```:

In [1]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(15, 15)})

from tqdm.notebook import tqdm
tqdm().pandas()


# Load in the acousticbrainz dataset into the variable 'acousticbrainz'
acousticbrainz = pd.read_hdf(Path.cwd() / 'datasets' / 'acousticbrainz.h5')

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

  from pandas import Panel


FileNotFoundError: File C:\Users\Chris\Documents\Thesis\datasets\acousticbrainz.h5 does not exist




The dataframe is indexed into two levels. The first level is the MBID and the second level is the submission id. Cell values are the label probabilities as given back by the classifier. The dataframe looks as follows:

In [None]:
acousticbrainz

Some recordings have many submissions, like `Bohemian Rhapsody`:

In [None]:
acousticbrainz.loc['b1a9c0e9-d987-4042-ae91-78d6a3267d69']

The dataframe contains the following classifications:

In [None]:
acousticbrainz.columns

# Classifier variance
There are multiple ways to look at this, we can either see how stable the *probabilities* are, i.e. how stable is the certainty of the classifier in the label being a specific value or we can see how stable the *labels* are, i.e. for all submissions are the labels the same or do they flip?

We'll begin with the first case.

We are interested in the probability values for a given label for the independent variable mbid. Some mbids only have one submission. These do not give us any information about the variance and should be filtered out:

In [None]:
filt = acousticbrainz.groupby(level=0).size() > 1
acousticbrainz = acousticbrainz[filt[acousticbrainz.index.get_level_values(level=0)].values]

acousticbrainz

We have many populations with relatively small sizes, and a few populations with a bit more (~30), however these are not sample sizes large enough to give us a good estimate of the classifier variance on the same songs.

However, we can calculate the variance for each individual population (with the probabilities in each population indexed $j=0,...,j=n_{i}-1$
$$s_{i}^{2} = \frac{1}{n_{i}-1} \sum_{j=0}^{j=n_{i}-1}(y_{j} - \bar{y_i})^2$$

And then compute the pooled variance for each classifier by taking the weighted average for all $k$ populations indexed $k=0,...,k-1$ 
$$s_{p}^{2} = \frac{\sum_{i=0}^{k-1}(n_{i}-1)s_{i}^{2}}{\sum_{i=0}^{k-1}(n_{i}-1)}$$

In [None]:
variances = acousticbrainz.groupby(level=0).var()
samplesizes = acousticbrainz.groupby(level=0).size()

pooledvariance = (variances.mul(samplesizes-1, axis=0).sum()) / (samplesizes.sum() - samplesizes.count())

print(pooledvariance.sort_values().to_string())

# Discrete label variance
Some classifiers, like `voice_instrumental`, `danceability` and `timbre` have relatively high variance. In the context of these calculations, that means that these classifiers can be seen as being somewhat **unreliable** or **uncertain**, since the probability values for the labels for the same recordings vary a lot. Other classifiers, like `genre_dortmund` seem to be very **reliable**, with very low variance. However, are these classifiers more reliable or simply more biased, predicting the same label every time thus lowering the variance?

## 'Biasedness' of the classifiers
To quantify this, we first need to transform the probabilities to hard labels. For this we select the most probable label:

In [None]:
grouped = acousticbrainz.groupby(axis=1, level=0).idxmax(axis=1).applymap(lambda x: x[1])
grouped

Now, for every column we can calculate the **entropy**, if a classifier always or nearly always produces the same result, then the informational value and thus the entropy will be low or even zero.
$$S_n(p)=-\sum_{i}p_{i}\log_2 p_{i}$$

With $$S_{max} = \log_2 n$$

To compare the entropies of the different classifiers, normalize them by their maximum values so every entropy falls in $[0,1]$:

$$S=-\sum_{i}\frac{p_i \log_2 p_i}{log_2 n}$$

Which is equivalent to

$$S=-\sum_{i}p_i \log_n p_i$$

In [None]:
import math
from scipy.stats import entropy

probabilities = pd.DataFrame()
for c in grouped.columns:
    probabilities = probabilities.append(grouped[c].value_counts(normalize=True)).fillna(value=0)
    
normalizers = acousticbrainz.groupby(level = 0, axis = 1).size()

norm_entropy = probabilities.apply(lambda row: entropy(row, base=normalizers[row.name]), axis=1)
norm_entropy.sort_values()

## Relation between biasedness and variance
Now we are interested in the relation between the entropy (i.e. roughly how biased the classifier is) and the variance of the probabilities of that classifier.

Note that for classifiers with only two labels, the variances of those two labels are the same (values are either one or the other), however for multiple labels the variances differ. For comparisons sake we take the average variance for each classifier.

In [None]:
avg_variance = pooledvariance.mean(level=0)
avg_variance.sort_values()

Now, ideally a classifier has low bias (high entropy) and low variance when running on different quality levels of the same recording (low pooled variance, high 'reliability'). We are interested in the relation between these two variables:

In [None]:
toplot = pd.DataFrame(columns=['Normalized entropy', 'Pooled variance'])
toplot['Normalized entropy'] = norm_entropy
toplot['Pooled variance'] = avg_variance

display(toplot)
# plot = toplot.plot(kind='scatter', x='Pooled variance', y='Normalized entropy')

p1 = sns.regplot('Normalized entropy', 'Pooled variance', toplot, fit_reg=False)
for index, val in toplot.iterrows():
    p1.text(val['Normalized entropy'] + 0.005, val['Pooled variance'] + 0.0005, index, horizontalalignment='left')


# Variance over labels instead of over probabilities
It can be argued that variance in the probability distribution over the labels is not harmful, as long as the labels themselves remain stable. It is more harmful when a classifier 'flips' the label. To calculate this variance, we take the same approach as with the probabilities by pooling the different populations, however now we look at the discrete labels.

For this we can, again use the normalized entropy:
$$S=-\sum_{i}\frac{p_i \log_2 p_i}{log_2 n}$$
However, now in the best case the entropy for a population is 0 (the label does not flip), and higher entropy means more flips and thus less reliability


We first calculate the label probabilities per recording:


In [None]:
population_probabilities = grouped.stack().groupby(level=[0,2]).value_counts(normalize=True).unstack().fillna(value=0)
population_probabilities

Now we pool the entropy much in the way we pooled the variances by taking the weighted average:
$$S_w=\frac{\sum_{i=0}^{k-1}n_i S_n}{\sum_{i=0}^{k-1}n_i}$$

In [None]:
pop_entropies = population_probabilities.progress_apply(lambda row: entropy(row, base=normalizers[row.name[1]]), axis=1)

In [None]:
pooled_entropy = (pop_entropies.unstack().mul(samplesizes, axis=0)).sum() / samplesizes.sum()
pooled_entropy.sort_values()

For these values it holds, the higher the pooled entropy, the more the discrete label flips. Thus, lower values are more stable.

In [None]:
toplot = pd.DataFrame(columns=['Normalized entropy', 'Label variance (pooled entropy)'])
toplot['Normalized entropy'] = norm_entropy
toplot['Label variance (pooled entropy)'] = pooled_entropy

display(toplot)
p2 = sns.regplot('Normalized entropy', 'Label variance (pooled entropy)', toplot, fit_reg=False)
for index, val in toplot.iterrows():
    p2.text(val['Normalized entropy'] + 0.005, val['Label variance (pooled entropy)'] + 0.0005, index, horizontalalignment='left')

# Classifier correlations
Another way to look at if the classifiers present in acousticbrainz do report what they intent to report in a *consistent* way is to check correlations between the different classifiers.

The reasoning behind this is as follows: we know some correlations from experience and/or psychology, for example if a piece of music is happy, then it is probably not sad (and thus we would expect a negative correlation between `mood_happy` and `mood_sad`) or we would expect that `mood_party` and `dancability` are slightly correlated, since some, but not all parties involve dancing.

If classifiers do not show the correlations we expect, then either:
- The hypothesis of the correlation is wrong
- The classifier does not (or does not fully) model the expected feature in the music correctly

Thus, if we make a list of likely correlation hypotheses, then we can use the correlation between the classifiers as another metric for the reliability of the classifiers.

### Correlation hypotheses
#### Genre classification
- `genre_dortmund, blues` - `genre_tzanetakis, blu`: strong positive correlation
- `genre_dortmund, electronic` - `genre_tzanetakis, hip`: moderate positive correlation
- `genre_dortmund, folkcountry` - `genre_tzanetakis, cou`: positive correlation
- `genre_dortmund, jazz` - `genre_tzanetakis, jaz`: strong positive correlation
- `genre_dortmund, pop` - `genre_tzanetakis, pop`: strong positive correlation
- `genre_dortmund, raphiphop` - `genre_tzanetakis, hip`: positive correlation
- `genre_dortmund, rock` - `genre_tzanetakis, roc`: strong positive correlation
- `genre_dortmund, blues` - `genre_rosamerica, rhy`: positive correlation
- `genre_dortmund, electronic` - `genre_rosamerica, dan`: moderate positive correlation
- `genre_dortmund, jazz` - `genre_rosamerica, jaz`: strong positive correlation
- `genre_dortmund, pop` - `genre_rosamerica, pop`: strong positive correlation
- `genre_dortmund, raphiphop` - `genre_rosamerica, hip`: positive correlation
- `genre_dortmund, rock` - `genre_rosamerica, roc`: strong positive correlation
- `genre_rosamerica, cla` - `genre_tzanetakis, cla`: strong positive correlation
- `genre_rosamerica, hip` - `genre_tzanetakis, hip`: strong positive correlation
- `genre_rosamerica, jaz` - `genre_tzanetakis, jaz`: strong positive correlation
- `genre_rosamerica, pop` - `genre_tzanetakis, pop`: strong positive correlation
- `genre_rosamerica, rhy` - `genre_tzanetakis, blu`: positive correlation
- `genre_rosamerica, roc` - `genre_tzanetakis, roc`: strong positive correlation
#### Mirex clusters
- `moods_mirex, cluster2` - `mood_happy, happy` : positive correlation
- `moods_mirex, cluster2` - `mood_sad, not_sad` : positive correlation
- `moods_mirex, cluster3` - `mood_happy, not_happy` : positive correlation
- `moods_mirex, cluster3` - `mood_sad, sad` : positive correlation
- `moods_mirex, cluster5` - `mood_aggressive, aggressive` : positive correlation
- `moods_mirex, cluster5` - `mood_relaxed, not_relaxed`: positive correlation

#### Other correlations
- `danceability, danceable` - `mood_party, party`: positive correlation
- `danceability, danceable` - `mood_relaxed, not_relaxed`: moderate positive correlation
- `danceability, danceable` - `genre_rosamerica, dan`: positive correlation
- `danceability, danceable` - `genre_tzanetakis, dis`: positive correlation
- `mood_acoustic, acoustic` - `mood_electronic, not_electronic`: positive correlation
- `mood_aggressive, aggressive` - `mood_relaxed, not_relaxed`: positive correlation
- `mood_electronic, electronic` - `genre_dortmund, electronic`: strong positive correlation
- `mood_happy, happy` - `mood_sad, not_sad`: positive correlation
- `mood_happy, happy` - `mood_party, party`: positive correlation


Now, let's calculate these correlations from the sample data and see if they match the hypotheses:

In [None]:
acousticbrainz = pd.read_hdf(Path.cwd() / 'datasets' / 'acousticbrainz.h5')