## Goal: Extract information about the results of each model.

As different state-of-the-art source separation systems use different methods of recording metrics, we define our own way of calculating it:

1. First we segment each song in the MUSDB18 test set into 8 second chunks, where the start of each chunk is 4 seconds from the previous one. We throw out any chunks that do not have sound in all sources.

2. We input the mix of each chunk into the deep model, and measure the metrics.

3. To calculate mean, median, and standard deviation, we find the mean, median, and standard deviation across all tracks. We refer to mean, median, and standard deviation to be summary statistics.

4. To calculate summary statistics of a specific source, we only consider the metrics collected for said source. To calculate summary statistics across all sources, we aggregate the metrics for each chunk by finding the mean across all sources, then we calculate the summary statistic over each mean.

This method of calculating metrics was concieved before calculating summary statistics of any model, and no intentional bias exists toward any model, which may not be the case for evaluation methods in other papers. 

In [20]:
import pandas as pd
import seaborn as sns
from IPython.display import display
sns.set_theme()
 
# Define helper functions
# Load a CSV with pd.read_csv(path)
def split_into_sources(df):
    """
    Returns a dictionary of dataframes
    """
    groups = df.groupby(by="source")
    return {name: g_df for name, g_df in groups}

def calculate_sum_stats(df):
    """
    Returns a pandas.DataFrame, where each row is a metric,
    and the columns are 'mean', 'median', 'std dev' respectively.
    """
    metrics = []
    data = []
    for col in df:
        if col in ["source", "file", "Unnamed: 0"]:
            continue
        metrics.append(col)
        data.append(
            {
                'mean': df[col].mean(),
                'median': df[col].median(),
                'std dev': df[col].std()
            }
        )
    return pd.DataFrame(
        data=data, index=metrics
    )

def aggreggate_sources(df_dict):
    """
    Takes a dictionary of dataframes, and find the mean
    for each track's metrics across sources.

    Returns a new dataframe containing the mean of each file.
    """
    
    df_concat = pd.concat(df_dict.values())
    by_file = df_concat.groupby(by="file")
    df_all = by_file.mean()
    return df_all
    


In [21]:
# Gathering data
raw_data = {}
raw_data["OpenUnmix"] = pd.read_csv("results/OpenUnmix/MUSDB18Segmented/aggreggate.csv")
raw_data["Demucs"] = pd.read_csv("results/Demucs/MUSDB18Segmented/aggreggate.csv")
raw_data["ConvTasNet"] = pd.read_csv("results/ConvTasNet/MUSDB18Segmented/aggreggate.csv")
raw_data["Wave-U-Net"] = pd.read_csv("results/Wave-U-Net/MUSDB18Segmented/aggreggate.csv")
raw_source_data = {}
for name, df in raw_data.items():
    raw_source_data[name] = split_into_sources(df)

def summary_stats_sources(df_dict):
    sum_dict = {}
    for name, df in df_dict.items():
        sum_dict[name] = calculate_sum_stats(df)
    return sum_dict

summary_source_data = {}
for name, dict_df in raw_source_data.items():
    summary_source_data[name] = summary_stats_sources(dict_df)

raw_aggreggate_data = {}
for name, dict_df in raw_source_data.items():
    raw_aggreggate_data[name] = aggreggate_sources(dict_df)

summary_aggreggate_data = {}
for name, df in raw_aggreggate_data.items():
    summary_aggreggate_data[name] = calculate_sum_stats(df)


## SDR, SI-SDR, and SI-SDRi of each model
First considering all sources, then source by source.

In [22]:
for name, df in summary_aggreggate_data.items():
    display(name)
    display(df.loc[["SDR", "SI-SDR", "SI-SDRi"]])

'OpenUnmix'

Unnamed: 0,mean,median,std dev
SDR,4.977359,5.321072,2.854907
SI-SDR,2.987725,3.593923,3.783498
SI-SDRi,8.786142,9.034659,2.929187


'Demucs'

Unnamed: 0,mean,median,std dev
SDR,5.739507,6.110004,2.903982
SI-SDR,4.176797,4.719269,3.445082
SI-SDRi,9.975013,10.213875,2.721571


'ConvTasNet'

Unnamed: 0,mean,median,std dev
SDR,6.113055,6.327604,2.670364
SI-SDR,4.252788,5.023714,4.19196
SI-SDRi,10.051004,10.513436,3.274007


'Wave-U-Net'

Unnamed: 0,mean,median,std dev
SDR,3.127092,3.428261,2.402484
SI-SDR,0.152212,0.67699,3.159529
SI-SDRi,5.950428,6.134531,2.274347


Observations: If you rank these algorithms by any combination of {mean, median} and {SDR, SI-SDR, and SI-SDRi},
you will get the ranking 'CONV-TASNET', DEMUCS', 'OPEN_UNMIX', and 'WAVE-U-NET'. This suggests, in the contexts of ranking algorithms by SDR, SI-SDR and SI-SDRi are just as good.

## 