# Sorting

This notebook organizes the analysis of the sorting experiments. Primarily, it just runs t-tests comparing the sorting performance to the baseline performance and reports any significant differences.

Run the following cells to import the required packages and load some helper functions.

In [30]:
from pathlib import Path
import scipy.stats as stats
from retrieval_exploration.common import util

Point the variable `data_dir` to the location of a directory that contains the results of running the [`run_summarization.py`](../scripts/run_summarization.py) script for one or more models

In [31]:
data_dir = "../output/results/"
# Make sure the directory exists and contains the expected subdirectories
!ls $data_dir

[1m[36mcochrane[m[m      [1m[36mms2[m[m           [1m[36mmultinews[m[m     [1m[36mmultixscience[m[m [1m[36mwcep[m[m


Then run the following block to perform the significance tests. Any significant differences will be reported.

In [32]:
for subdir in Path(data_dir).iterdir():
    # Some datasets have blind test splits, and so we evaluate on the validation set
    # HuggingFace assigns a different prefix to the keys in the output json, so set that here
    metric_key_prefix = "eval" if subdir.name in {"ms2", "cochrane"} else "predict"

    # The metrics we want to plot the delta for
    metric_columns = [
        f"{metric_key_prefix}_rouge_avg_fmeasure",
        f"{metric_key_prefix}_bertscore_f1",
    ]
    # Load the results as dataframes
    baseline_df, perturbed_df = util.load_results_dicts(
        data_dir=subdir,
        metric_columns=metric_columns,
        metric_key_prefix=metric_key_prefix,
    )

    # Only care about sorting results
    perturbed_df = perturbed_df[perturbed_df[f"{metric_key_prefix}_perturbation"] == "sorting"]

    # Perform the signifiance test for all models, selection strategies, and metrics
    for model_name_or_path in perturbed_df.model_name_or_path.unique():
        for selection_strategy in perturbed_df[f"{metric_key_prefix}_selection_strategy"].unique():
            for metric in metric_columns:
                baseline_scores = baseline_df[baseline_df.model_name_or_path == model_name_or_path][metric]
                perturbed_scores = perturbed_df[perturbed_df.model_name_or_path == model_name_or_path][
                    perturbed_df[f"{metric_key_prefix}_selection_strategy"] == selection_strategy
                ][metric]

                _, pvalue = stats.ttest_ind(baseline_scores, perturbed_scores)

                if pvalue < 0.05:
                    print(
                        f"Model {model_name_or_path} with selection strategy {selection_strategy} has a"
                        f" significant difference in {metric} with p-value {pvalue}."
                        f" Baseline: {baseline_scores.mean()}, Perturbed: {perturbed_scores.mean()}"
                    )

100%|██████████| 48/48 [00:24<00:00,  1.99it/s]
0it [00:00, ?it/s]
100%|██████████| 48/48 [01:00<00:00,  1.27s/it]
100%|██████████| 3/3 [00:03<00:00,  1.30s/it]
100%|██████████| 3/3 [00:03<00:00,  1.21s/it]
100%|██████████| 48/48 [04:19<00:00,  5.41s/it]
100%|██████████| 48/48 [01:39<00:00,  2.07s/it]
100%|██████████| 48/48 [03:19<00:00,  4.16s/it]
100%|██████████| 48/48 [01:49<00:00,  2.28s/it]
0it [00:00, ?it/s]
  perturbed_scores = perturbed_df[perturbed_df.model_name_or_path == model_name_or_path][
