## Goal: Extract information about the results of each model.

As different state-of-the-art source separation systems use different methods of recording metrics, we define our own way of calculating it on the MUSDB18 dataset:

1. For each song in the MUSDB18 dataset, create overlapping 8 second chunks, where the first chunk of each song is at 0 seconds, and each subsequent chunk starts 4 seconds after the previous one.

2. Remove chunks that do not have adaquate sound for all sources. For each song, this is done by finding the chunk with the maximum power in each source, which we call the reference chunk, and removing all chunks with power less than 8 db than the reference chunk.

3. Separate the mix in each chunk using the desired model. Record metrics. 

4. Summary statistics are mean, median, and standard deviation. To calculate the summary statistics of a metric for one source, we calculate the statistic over the recorded metric over all chunks. To calculate the summary statistics of a metric over all sources, we first find the mean metric for all sources in each chunk, then we calculate the summary statistics over these means. 

This method of calculating metrics was concieved before calculating summary statistics of any model, and no intentional bias exists toward any model, which may not be the case for evaluation methods in other papers. 

In [20]:
import pandas as pd
import seaborn as sns
from IPython.display import display
sns.set_theme()
 
# Define helper functions
# Load a CSV with pd.read_csv(path)
def split_into_sources(df):
    """
    Returns a dictionary of dataframes
    """
    groups = df.groupby(by="source")
    return {name: g_df for name, g_df in groups}

def calculate_sum_stats(df):
    """
    Returns a pandas.DataFrame, where each row is a metric,
    and the columns are 'mean', 'median', 'std dev' respectively.
    """
    metrics = []
    data = []
    for col in df:
        if col in ["source", "file", "Unnamed: 0"]:
            continue
        metrics.append(col)
        data.append(
            {
                'mean': df[col].mean(),
                'median': df[col].median(),
                'std dev': df[col].std()
            }
        )
    return pd.DataFrame(
        data=data, index=metrics
    )

def aggreggate_sources(df_dict):
    """
    Takes a dictionary of dataframes, and find the mean
    for each track's metrics across sources.

    Returns a new dataframe containing the mean of each file.
    """
    
    df_concat = pd.concat(df_dict.values())
    by_file = df_concat.groupby(by="file")
    df_all = by_file.mean()
    return df_all
    


In [21]:
# Gathering data
raw_data = {}
# To add new models, only a single line here must be added
raw_data["OpenUnmix"] = pd.read_csv("results/OpenUnmix/MUSDB18Segmented/aggreggate.csv")
raw_data["Demucs"] = pd.read_csv("results/Demucs/MUSDB18Segmented/aggreggate.csv")
raw_data["ConvTasNet"] = pd.read_csv("results/ConvTasNet/MUSDB18Segmented/aggreggate.csv")
raw_data["Wave-U-Net"] = pd.read_csv("results/Wave-U-Net/MUSDB18Segmented/aggreggate.csv")
raw_source_data = {}
for name, df in raw_data.items():
    raw_source_data[name] = split_into_sources(df)

def summary_stats_sources(df_dict):
    sum_dict = {}
    for name, df in df_dict.items():
        sum_dict[name] = calculate_sum_stats(df)
    return sum_dict

summary_source_data = {}
for name, dict_df in raw_source_data.items():
    summary_source_data[name] = summary_stats_sources(dict_df)

raw_aggreggate_data = {}
for name, dict_df in raw_source_data.items():
    raw_aggreggate_data[name] = aggreggate_sources(dict_df)

summary_aggreggate_data = {}
for name, df in raw_aggreggate_data.items():
    summary_aggreggate_data[name] = calculate_sum_stats(df)


## SDR, SI-SDR, and SI-SDRi of each model
First considering all sources, then source by source.

In [22]:
for name, df in summary_aggreggate_data.items():
    display(name)
    display(df.loc[["SDR", "SI-SDR", "SI-SDRi"]])

'OpenUnmix'

Unnamed: 0,mean,median,std dev
SDR,4.977359,5.321072,2.854907
SI-SDR,2.987725,3.593923,3.783498
SI-SDRi,8.786142,9.034659,2.929187


'Demucs'

Unnamed: 0,mean,median,std dev
SDR,5.739507,6.110004,2.903982
SI-SDR,4.176797,4.719269,3.445082
SI-SDRi,9.975013,10.213875,2.721571


'ConvTasNet'

Unnamed: 0,mean,median,std dev
SDR,6.113055,6.327604,2.670364
SI-SDR,4.252788,5.023714,4.19196
SI-SDRi,10.051004,10.513436,3.274007


'Wave-U-Net'

Unnamed: 0,mean,median,std dev
SDR,3.127092,3.428261,2.402484
SI-SDR,0.152212,0.67699,3.159529
SI-SDRi,5.950428,6.134531,2.274347


Observations: 
+ If you rank these algorithms by any combination of {mean, median} and {SDR, SI-SDR, and SI-SDRi},
you will get the ranking 'CONV-TASNET', DEMUCS', 'OPEN_UNMIX', and 'WAVE-U-NET'. This suggests, in the contexts of ranking algorithms by SDR, SI-SDR and SI-SDRi are just as good. 
+ Note how ConvTasNet has a higher std dev SI-SDR than Demucs, while Demucs has a higher std dev for SDR than ConvTasNet. This could mean nothing, but it could also be a sign about outliers in the data, and how SDR and SI-SDR handle scale.

In [28]:
## How does this hold across sources?
for name, df_dict in summary_source_data.items():
    display(name)
    to_concat = []
    for source, df in df_dict.items():
        values = df.loc[["SDR", "SI-SDR", "SI-SDRi"]]
        values.rename({"SDR": f"{source}: SDR",
                       "SI-SDR": f"{source}: SI-SDR",
                       "SI-SDRi": f"{source}: SI-SDRi"},
                      inplace=True)
        to_concat.append(values)
    df_concat = pd.concat(to_concat)
    display(df_concat)

'OpenUnmix'

Unnamed: 0,mean,median,std dev
bass: SDR,4.60022,5.039939,6.15294
bass: SI-SDR,2.47788,3.514696,7.645522
bass: SI-SDRi,9.477115,10.08216,5.250655
drums: SDR,5.889356,5.958959,4.1576
drums: SI-SDR,4.35193,4.705449,4.898521
drums: SI-SDRi,9.061415,9.441004,3.896217
other: SDR,3.735364,3.837369,2.83166
other: SI-SDR,0.744644,1.763143,5.416857
other: SI-SDRi,6.266366,6.554414,5.055254
vocals: SDR,5.684497,6.551987,6.366212


'Demucs'

Unnamed: 0,mean,median,std dev
bass: SDR,5.838194,6.333488,6.404391
bass: SI-SDR,4.314169,5.268259,7.160211
bass: SI-SDRi,11.317342,11.715631,4.770529
drums: SDR,7.070687,6.994881,4.228774
drums: SI-SDR,6.088535,6.103106,4.374326
drums: SI-SDRi,10.799997,11.067704,3.45655
other: SDR,3.973712,4.130461,3.420524
other: SI-SDR,1.656913,2.227863,4.42616
other: SI-SDRi,7.174239,7.324543,4.270892
vocals: SDR,6.075435,6.836668,5.958214


'ConvTasNet'

Unnamed: 0,mean,median,std dev
bass: SDR,6.245008,6.693001,5.917041
bass: SI-SDR,4.068171,5.672799,9.019312
bass: SI-SDRi,11.071345,12.219177,6.626964
drums: SDR,7.559869,7.563258,4.011636
drums: SI-SDR,6.526681,6.769751,4.929581
drums: SI-SDRi,11.238143,11.658954,3.679415
other: SDR,4.35364,4.488937,2.824071
other: SI-SDR,1.773237,2.6768,4.983473
other: SI-SDRi,7.290562,7.390878,4.541678
vocals: SDR,6.293704,6.867305,4.972601


'Wave-U-Net'

Unnamed: 0,mean,median,std dev
bass: SDR,2.512389,3.09165,6.071732
bass: SI-SDR,-0.16273,0.56672,6.468393
bass: SI-SDRi,6.840443,7.557735,4.312968
drums: SDR,3.91215,3.961964,3.797256
drums: SI-SDR,1.311155,1.819594,4.924104
drums: SI-SDRi,6.022617,6.568623,4.179983
other: SDR,2.227695,2.251655,2.380516
other: SI-SDR,-2.162498,-1.14646,5.002933
other: SI-SDRi,3.354828,3.838635,4.108231
vocals: SDR,3.856134,4.455277,5.169745


### Observations
+ So while the above dataframes just seems like a bunch of numbers, the rankings of models should be noted. Notably, in the mean and median values for Vocal SDR, the ranking of models is: Conv-TasNet, Demucs, OpenUnmix, Wave-U-Net. This differs from the rankings of models on papers with code, where Conv-TasNet and Demucs are swapped. (This is also true for the Other source)
+ The difference in mean Vocal SDR between Demucs and Conv-TasNet is .22, while in median, the difference is .03. This implies that there exists some outliers on the low end of SDR for Demucs

## SAR

In [30]:
for name, df in summary_aggreggate_data.items():
    display(name)
    display(df.loc[["SAR", "SI-SAR"]])

'OpenUnmix'

Unnamed: 0,mean,median,std dev
SAR,7.176612,7.091356,1.560591
SI-SAR,3.877729,4.443823,3.577261


'Demucs'

Unnamed: 0,mean,median,std dev
SAR,7.43887,7.487287,1.773577
SI-SAR,4.852725,5.3621,3.249555


'ConvTasNet'

Unnamed: 0,mean,median,std dev
SAR,7.644756,7.803368,2.040657
SI-SAR,5.018947,5.785666,3.989788


'Wave-U-Net'

Unnamed: 0,mean,median,std dev
SAR,5.130867,5.244051,1.396923
SI-SAR,1.258896,1.834626,3.10818
