# Evaluation

This notebook reproduces the evaluation results from the paper.

Note: As of version 2.1b6.dev234, the Essentia library has a [bug](https://github.com/MTG/essentia/issues/1054) that causes an infinite loop for some inputs.
To avoid this, you have to build our patched version of Essentia: https://github.com/cifkao/essentia/tree/patched

Copyright 2020 InterDigital R&D and Télécom Paris.  
Author: Ondřej Cífka

## Obtaining the outputs
Before running the evaluation, we need to obtain the outputs of all the systems on both of our test sets and place them in the `outputs/synth` and `outputs/real` directories (for the artificial and real inputs, respectively). The commands are different for each system:

### VQ-VAE
```sh
python -m ss_vq_vae.models.vqvae_oneshot --logdir=model run ../data/lmd/audio_test/pairs \
    outputs/synth/vqvae_list outputs/synth/vqvae
python -m ss_vq_vae.models.vqvae_oneshot --logdir=model run ../data/mixing_secrests/test/pairs \
    outputs/real/vqvae_list outputs/real/vqvae
```
The first command runs the model on all audio file pairs listed in the `../data/lmd/audio_test/pairs` file, writes the output files to the `outputs/synth/vqvae` directory and their paths to the file `outputs/synth/vqvae_list`. The second command does the same for the other test set.

### U+L (Ulyanov and Lebedev)
```sh
python -m ss_vq_vae.models.ulyanov --style-weight-log=-2.1 ../data/lmd/audio_test/pairs \
    outputs/synth/ulyanov_swopt_list outputs/synth/ulyanov
python -m ss_vq_vae.models.ulyanov --style-weight-log=-2.1 ../data/mixing_secrets/test/pairs \
    outputs/real/ulyanov_swopt_list outputs/real/ulyanov
```

### Musaicing (Driedger et al.)
Clone Chris Tralie's [LetItBee repo](https://github.com/ctralie/LetItBee) and run the `Musaicing.py` script on each pair of audio files according to the instructions. Specify the content file using the `--target` option and the style file using the `--source` option, e.g.:
```sh
python LetItBee/Musaicing.py --sr 16000 \
    --source ../data/lmd/audio_test/wav_16kHz/voices1_pitch1/00484d071147e49551de9ffb141e8b9e.style.wav \
    --target ../data/lmd/audio_test/wav_16kHz/voices1_pitch1/00484d071147e49551de9ffb141e8b9e.content.wav \
    --result outputs/synth/driedger/00000.wav
```
You might want to run these commands in parallel as they are time-consuming. Remember to write the list of output files to the `outputs/{synth,real}/driedger_list` file in the correct order, so that the evaluation code can pick them up.

In [2]:
import os
import pickle

import essentia.standard as estd
import librosa
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import pretty_midi
import re
import seaborn as sns
from tqdm.auto import tqdm

from ss_vq_vae.models import triplet_network

In [3]:
SR = 16000

MFCC_KWARGS = dict(
    n_mfcc=13,
    hop_length=500
)

In [5]:
triplet_model, triplet_backbone = triplet_network.build_model(num_features=12)
triplet_model.load_weights('timbre_metric/checkpoint.ckpt')

Model: "backbone"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None, 12)]        0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          3136      
_________________________________________________________________
batch_normalization (BatchNo (None, None, 64)          256       
_________________________________________________________________
activation (Activation)      (None, None, 64)          0         
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 64)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 64)          16448     
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 64)          256

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fcd207cba90>

In [6]:
def read_paths(tsv_path, column_names):
    parent_dir = os.path.dirname(tsv_path)
    df = pd.read_csv(tsv_path, sep='\t', names=column_names)
    df = df.applymap(lambda x: os.path.join(parent_dir, x))
    return df

In [7]:
def normalize_power(audio):
    return audio / (np.sqrt(np.mean(audio ** 2)) + np.finfo(audio.dtype).eps)

def get_pitches(audio):
    input_sr, sr = SR, 8000  # Need to resample because of EqualLoudness
    audio = estd.Resample(inputSampleRate=input_sr, outputSampleRate=sr)(audio)
    audio = estd.EqualLoudness(sampleRate=sr)(audio)
    rng = np.random.default_rng(seed=(audio > 0).sum())
    audio = rng.normal(loc=audio, scale=1e-4).astype(audio.dtype)  # To prevent Melodia from crashing
    pitches = estd.MultiPitchMelodia(sampleRate=sr)(audio)
    pitches = [[pretty_midi.utilities.hz_to_note_number(p) for p in pl if not np.isclose(0, p)]
               for pl in pitches]
    pitches = [[int(p + 0.5) for p in pl] for pl in pitches]
    return pitches

In [8]:
def eval_example_target(output, reference):
    def spec(audio):
        audio = normalize_power(audio)
        s = librosa.feature.melspectrogram(audio, sr=SR)
        return librosa.power_to_db(s)
    
    s_out, s_ref = spec(output), spec(reference)
    lsd = np.mean(np.sqrt(np.mean((s_out - s_ref) ** 2, axis=1)))

    return {'lsd': lsd}

def eval_example_style(output, reference):
    mfcc_out = librosa.feature.mfcc(output, sr=SR, **MFCC_KWARGS)[1:]
    mfcc_ref = librosa.feature.mfcc(reference, sr=SR, **MFCC_KWARGS)[1:]
    
    mfcc_triplet_cos, _ = 1 - triplet_model.predict([
        (mfcc_ref.T[None, :, :], mfcc_out.T[None, :, :], mfcc_out.T[None, :, :])]).reshape(2)

    return {'mfcc_triplet_cos': mfcc_triplet_cos}

def eval_example_content(output, reference):
    pitches_output, pitches_reference = get_pitches(output), get_pitches(reference)
    assert len(pitches_output) == len(pitches_reference)
    jaccard = []
    for pl_output, pl_reference in zip(pitches_output, pitches_reference):
        matches = len(set(pl_output) & set(pl_reference))
        total = len(set(pl_output) | set(pl_reference))
        if total == 0:
            jaccard.append(0)
        else:
            jaccard.append(1 - matches / total)
    jaccard = np.mean(jaccard)
    return {'pitch_jaccard': jaccard}

def pad_or_truncate(audio, reference):
    if len(audio) < len(reference):
        return np.pad(audio, (0, max(0, len(reference) - len(audio))))
    return audio[:len(reference)]

def eval_row_synth(row):
    audio = row.apply(lambda path: librosa.load(path, sr=SR)[0])
    audio = audio.apply(pad_or_truncate, reference=audio['target'])
    
    return pd.DataFrame({
        key: {
            **eval_example_target(audio[key], audio['target']),
            **eval_example_style(audio[key], audio['target']),
            **eval_example_content(audio[key], audio['target'])
        }
        for key in row.keys() if key != 'target'
    }).stack()
        
def eval_row_real(row):
    audio = row.apply(lambda path: librosa.load(path, sr=SR)[0])
    audio_ref = audio[['content', 'style']]
    audio = audio.apply(pad_or_truncate, reference=audio_ref['content'])
    return pd.DataFrame({
        key: {
            **eval_example_style(audio[key], audio_ref['style']),
            **eval_example_content(audio[key], audio_ref['content'])
        }
        for key in row.keys()
    }).stack()

In [9]:
paths_synth_df = pd.concat([
    read_paths('../data/lmd/audio_test/triplets',
               ['content', 'style', 'target']),
    read_paths('outputs/synth/vq-vae_list',
               ['vq-vae']),
    read_paths('outputs/synth/driedger_list',
               ['driedger']),
    read_paths('outputs/synth/ulyanov_list',
               ['ulyanov']),
], axis=1)

In [10]:
paths_real_df = pd.concat([
    read_paths('../data/mixing_secrets/test/pairs',
               ['content', 'style']),
    read_paths('outputs/real/vq-vae_list',
               ['vq-vae']),
    read_paths('outputs/real/driedger_list',
               ['driedger']),
    read_paths('outputs/real/ulyanov_list',
               ['ulyanov'])
], axis=1)

In [None]:
with tqdm(total=len(paths_synth_df)) as pbar:
    pbar.update(-1)
    def fn(x):
        y = eval_row_synth(x)
        pbar.update(1)
        return y
    results_synth = paths_synth_df.apply(fn, axis=1)

In [None]:
with tqdm(total=len(paths_real_df)) as pbar:
    pbar.update(-1)
    def fn(x):
        y = eval_row_real(x)
        pbar.update(1)
        return y
    results_real = paths_real_df.apply(fn, axis=1)

In [None]:
results_synth.to_pickle('results_synth.pickle')
results_real.to_pickle('results_real.pickle')

In [None]:
results_synth = pd.read_pickle('results_synth.pickle')
results_real = pd.read_pickle('results_real.pickle')

In [None]:
results_all = pd.concat([results_synth, results_real], axis=1, keys=['synth', 'real'])

In [None]:
pd.DataFrame(results_all.mean()).unstack(level=0).unstack(level=0).droplevel(axis=1, level=0).drop(('real', 'lsd'), axis=1)

In [43]:
latex = (pd.DataFrame(results_all.mean())
      .unstack(level=0).unstack(level=0)
      .droplevel(axis=1, level=0)
      .drop(('real', 'lsd'), axis=1)
      .loc[['content', 'style', 'ulyanov', 'driedger', 'vq-vae']]
      .to_latex(formatters=[x.format for x in ['{:0.2f}', '{:0.4f}', '{:0.4f}', '{:0.4f}', '{:0.4f}']]))
latex = re.sub(r' +', ' ', latex)
print(latex)

\begin{tabular}{lrrrrr}
\toprule
{} & \multicolumn{3}{l}{synth} & \multicolumn{2}{l}{real} \\
{} & lsd & mfcc\_triplet\_cos & pitch\_jaccard & mfcc\_triplet\_cos & pitch\_jaccard \\
\midrule
content & 14.62 & 0.3713 & 0.5365 & 0.4957 & 0.0000 \\
style & 20.36 & 0.2681 & 0.8729 & 0.0000 & 0.9099 \\
ulyanov & 14.50 & 0.3483 & 0.5441 & 0.4792 & 0.1315 \\
driedger & 14.51 & 0.2933 & 0.6445 & 0.2319 & 0.6297 \\
vq-vae & 12.16 & 0.2063 & 0.5500 & 0.2278 & 0.6197 \\
\bottomrule
\end{tabular}

