In [None]:
from capo.analysis.utils import (
    get_results,
    aggregate_results,
    get_prompt_scores,
    generate_comparison_table,
)
from capo.analysis.visualizations import (
    plot_population_scores,
    plot_population_members,
    plot_population_scores_comparison,
    plot_length_score,
)

In [None]:
OPTIMS = ["CAPO", "EvoPromptGA", "OPRO", "PromptWizard"]
DATASETS = ["sst-5", "agnews", "copa", "gsm8k", "subj"]
MODELS = ["llama", "mistral", "qwen"]

In [None]:
%load_ext autoreload
%autoreload 2

# Population Comparison

In [None]:
for dataset in DATASETS:
    for model in MODELS:
        plot_population_scores_comparison(
            dataset,
            model,
            OPTIMS,
            agg="mean",
            plot_seeds=True,
            plot_stddev=True,
            x_col="input_tokens_cum",
        )

Candidates for main paper
- GSM8K (because its most relevant dataset)
- Subj using qwen (because it has beautiful curves)

Takeaways:
- PromptWizard's performance is highly dependend on model used (=> strict templates!)

# Table Results

In [None]:
for model in MODELS:
    print(f"Model: {model}")
    display(generate_comparison_table(model=model, cutoff_tokens=1_000_000))

In [None]:
for model in MODELS:
    print(f"Model: {model}")
    display(generate_comparison_table(model=model, cutoff_tokens=3_000_000))

In [None]:
for model in MODELS:
    print(f"Model: {model}")
    display(generate_comparison_table(model=model))

In [None]:
for dataset in DATASETS:
    for model in MODELS:
        plot_length_score(
            dataset, model, OPTIMS, x_col="prompt_len", score_col="test_score", log_scale=False
        )

=> maybe we are cost aware in the sense that we are evaluating the entire "front" (EvoPrompt and Opro are very short and Promptwizard very long)

- promptwizard has extremly long prompts, that only sometimes can compete with competitors

=> interesting for plotting: 
- subj using qwen or gsm8k using mistral => shows that we have a huge range

In [None]:
from pprint import pprint as pp

In [None]:
# print best prompt per dataset, model, optimizer
for dataset in DATASETS:
    for model in MODELS:
        for optim in OPTIMS:
            print(f"Dataset: {dataset}, Model: {model}, Optimizer: {optim}")
            df = get_results(
                dataset=dataset,
                model=model,
                optim=optim,
                # sort_by="test_score",
                # ascending=False,
            )

            if df.empty:
                continue
            p, s = df.nlargest(1, "test_score")[["prompt", "test_score"]].values[0]
            print(s)
            print("'''")
            pp(p)
            print("'''")

capo can be very repetitive? (SST-5 mistral) potentially the crossover meta prompt has been misinterpreted (merge the two prompts) => however it is performing superior!

subj for qwen and llama with capo has a crazy outlier to the top
