## Analysis for comparing Top P/K impact on determinism

It is unclear what impact the the top_p or top_k parameter has on performance. Whatever we settle on it appears very likely someone will claim that the other value will be the more deterministic one. This experiment tries both ends of the parameter. 

In [None]:
# Run evaluate.py on the directory 'runs'
import os
import pandas as pd

os.system('python ../../evaluate.py -d top_p_k_0_vs_1 -npp')



## 0-shot Top P/K varied between 0.0 and 1.0

In [16]:
import pandas as pd
import os
from itertools import product
from collections import defaultdict

stability_df = pd.read_csv('stability_eval.csv')

stability_df =\
     stability_df[stability_df['task_config'] == 
                    "{'prompt_type': 'v2', 'shots': 0}"]

stability_df = stability_df.sort_values(by=['task', 'model', 'model_config'])

display(stability_df)

summary_d = defaultdict(list)
for model, task in product(set(stability_df['model']), 
                            set(stability_df['task'])):
    #print("-------------")
    #print(f"Model: {model}, Task: {task}:")
    mod_task_df = stability_df[(stability_df['model'] == model) 
                              & (stability_df['task'] == task)]
    mod_task_df = mod_task_df.sort_values(by=['model_config'])
    if len(mod_task_df.index) != 2:
        print(f"Expecting 2 entries in stability_output.csv: {model} x {task}")
        print(f"Got {len(mod_task_df.index)}, ignoring results")
        continue
    summary_d['model'].append(model)
    summary_d['task'].append(task)
    format_cols = set()
    for col in ['TARr', 'TARa', 'best_possible_accuracy', 
                'worst_possible_accuracy']:
        schema_val = mod_task_df[col].iloc[0]
        raw_val = mod_task_df[col].iloc[1]
        summary_d[f'{col}: Top P/K 0.0 - 1.0'].append(f"{schema_val:.1%} - {raw_val:.1%} = {schema_val- raw_val:.1%}")
        col_name = f'{col}'
        # summary_d[col_name].append(schema_val- raw_val)
        format_cols.add(col_name)

display(pd.DataFrame(summary_d).style.format({
    col: '{:.1%}' for col in format_cols
}))


Unnamed: 0.1,Unnamed: 0,model,model_config,task,task_config,TACr,TARr,TACa,TARa,correct_count_per_run,...,num_questions,N,best_possible_count,best_possible_accuracy,worst_possible_count,worst_possible_accuracy,spread,bootstrap_counts,bootstrap_pcts,date
26,26,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 0}",51,0.51,72,0.72,"[33, 36, 34, 33, 35, 34, 34, 34, 33, 32]",...,100,10,41,0.41,29,0.29,0.12,"[31, 32, 32, 33, 33, 33, 33, 34, 35, 35]","[0.31, 0.32, 0.32, 0.33, 0.33, 0.33, 0.33, 0.3...",2024-12-28_17-04-16
10,10,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 0}",10,0.1,46,0.46,"[28, 29, 32, 34, 35, 32, 34, 30, 34, 32]",...,100,10,54,0.54,15,0.15,0.39,"[29, 30, 31, 32, 33, 33, 34, 34, 35, 37]","[0.29, 0.3, 0.31, 0.32, 0.33, 0.33, 0.34, 0.34...",2025-02-13_10-57-30
58,58,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 0}",1,0.01,50,0.5,"[60, 56, 61, 60, 58, 56, 54, 57, 56, 60]",...,100,10,79,0.79,38,0.38,0.41,"[52, 54, 55, 56, 56, 58, 58, 59, 60, 63]","[0.52, 0.54, 0.55, 0.56, 0.56, 0.58, 0.58, 0.5...",2024-12-28_15-01-21
42,42,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 0}",0,0.0,50,0.5,"[59, 59, 55, 59, 52, 61, 64, 56, 60, 55]",...,100,10,85,0.85,41,0.41,0.44,"[56, 56, 57, 57, 57, 58, 58, 59, 59, 60]","[0.56, 0.56, 0.57, 0.57, 0.57, 0.58, 0.58, 0.5...",2025-02-13_14-33-32
20,20,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 0}",243,0.972,243,0.972,"[31, 31, 31, 31, 31, 32, 32, 31, 31, 32]",...,250,10,33,0.132,30,0.12,0.012,"[30, 30, 31, 31, 32, 32, 32, 32, 32, 33]","[0.12, 0.12, 0.124, 0.124, 0.128, 0.128, 0.128...",2024-12-22_15-49-51
4,4,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 0}",229,0.916,229,0.916,"[37, 39, 37, 38, 38, 38, 41, 39, 38, 39]",...,250,10,42,0.168,34,0.136,0.032,"[37, 38, 38, 38, 38, 39, 39, 39, 39, 40]","[0.148, 0.152, 0.152, 0.152, 0.152, 0.156, 0.1...",2025-02-13_15-17-24
52,52,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 0}",0,0.0,123,0.492,"[131, 139, 137, 145, 148, 131, 137, 138, 142, ...",...,250,10,188,0.752,82,0.328,0.424,"[135, 138, 139, 139, 141, 142, 143, 144, 144, ...","[0.54, 0.552, 0.556, 0.556, 0.564, 0.568, 0.57...",2024-12-22_15-48-27
36,36,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 0}",2,0.008,114,0.456,"[133, 138, 135, 145, 140, 144, 141, 143, 143, ...",...,250,10,190,0.76,76,0.304,0.456,"[134, 139, 139, 139, 140, 142, 144, 144, 145, ...","[0.536, 0.556, 0.556, 0.556, 0.56, 0.568, 0.57...",2025-02-14_07-38-25
22,22,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",high_school_european_history,"{'prompt_type': 'v2', 'shots': 0}",150,0.909091,158,0.957576,"[110, 110, 109, 110, 110, 110, 109, 111, 108, ...",...,165,10,111,0.672727,108,0.654545,0.018182,"[108, 109, 109, 110, 110, 110, 110, 110, 110, ...","[0.6545454545454545, 0.6606060606060606, 0.660...",2024-12-29_11-46-23
6,6,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",high_school_european_history,"{'prompt_type': 'v2', 'shots': 0}",156,0.945455,156,0.945455,"[123, 122, 122, 123, 124, 124, 121, 124, 122, ...",...,165,10,124,0.751515,120,0.727273,0.024242,"[122, 122, 122, 122, 122, 123, 123, 123, 124, ...","[0.7393939393939394, 0.7393939393939394, 0.739...",2025-02-13_12-20-59


Unnamed: 0,model,task,TARr: Top P/K 0.0 - 1.0,TARa: Top P/K 0.0 - 1.0,best_possible_accuracy: Top P/K 0.0 - 1.0,worst_possible_accuracy: Top P/K 0.0 - 1.0
0,gpt-4o,geometric_shapes,0.0% - 0.8% = -0.8%,49.2% - 45.6% = 3.6%,75.2% - 76.0% = -0.8%,32.8% - 30.4% = 2.4%
1,gpt-4o,public_relations,39.1% - 38.2% = 0.9%,83.6% - 83.6% = 0.0%,80.9% - 81.8% = -0.9%,66.4% - 66.4% = 0.0%
2,gpt-4o,high_school_european_history,18.8% - 17.0% = 1.8%,74.5% - 74.5% = 0.0%,75.2% - 76.4% = -1.2%,54.5% - 55.2% = -0.6%
3,gpt-4o,college_mathematics,1.0% - 0.0% = 1.0%,50.0% - 50.0% = 0.0%,79.0% - 85.0% = -6.0%,38.0% - 41.0% = -3.0%
4,gpt-4o,ruin_names,21.2% - 27.6% = -6.4%,93.6% - 93.6% = 0.0%,86.0% - 85.2% = 0.8%,80.4% - 80.0% = 0.4%
5,gpt-4o,professional_accounting,3.5% - 4.3% = -0.7%,74.5% - 71.3% = 3.2%,83.0% - 84.0% = -1.1%,58.5% - 58.5% = 0.0%
6,gpt-4o,navigate,7.6% - 15.2% = -7.6%,88.8% - 91.6% = -2.8%,96.8% - 94.8% = 2.0%,86.0% - 88.8% = -2.8%
7,gpt-4o,logical_deduction,5.6% - 7.6% = -2.0%,97.2% - 96.8% = 0.4%,100.0% - 100.0% = 0.0%,97.2% - 96.0% = 1.2%
8,gpt-35-turbo,geometric_shapes,97.2% - 91.6% = 5.6%,97.2% - 91.6% = 5.6%,13.2% - 16.8% = -3.6%,12.0% - 13.6% = -1.6%
9,gpt-35-turbo,public_relations,98.2% - 86.4% = 11.8%,97.3% - 92.7% = 4.5%,67.3% - 66.4% = 0.9%,67.3% - 61.8% = 5.5%


Analysis of 0-shot:

For GPT-4o the variation appears to be fairly small, perhaps the parameter is ignored since the variation looks like standard variation. 

For GPT-3.5 Turbo there does appear to have considerable variation for:
- `college_mathematics` has TARr of 51.0% - 10.0% = 41.0%, which carries over to TARa, 72.0% - 46.0% = 26.0%. It also has variation on best and worst possible answers with 41.0% - 54.0% = -13.0%, best_possible answer and 29.0% - 15.0% = 14.0% for worst possible answer. 
- `professional_accounting` has TARr of 84.0% - 49.3% = 34.8%, but less impact on TARa, 95.4% - 81.9% = 13.5%


## Few-shot Top P/K varied between 0.0 and 1.0

In [15]:
import pandas as pd
import os
from itertools import product
from collections import defaultdict

stability_df = pd.read_csv('stability_eval.csv')

stability_df =\
     stability_df[stability_df['task_config'] == 
                    "{'prompt_type': 'v2', 'shots': 'few'}"]

stability_df = stability_df.sort_values(by=['task', 'model', 'model_config'])

display(stability_df)

summary_d = defaultdict(list)
for model, task in product(set(stability_df['model']), 
                            set(stability_df['task'])):
    #print("-------------")
    #print(f"Model: {model}, Task: {task}:")
    mod_task_df = stability_df[(stability_df['model'] == model) 
                              & (stability_df['task'] == task)]
    mod_task_df = mod_task_df.sort_values(by=['model_config'])
    if len(mod_task_df.index) != 2:
        print(f"Expecting 2 entries in stability_output.csv: {model} x {task}")
        print(f"Got {len(mod_task_df.index)}, ignoring results")
        continue
    summary_d['model'].append(model)
    summary_d['task'].append(task)
    format_cols = set()
    for col in ['TARr', 'TARa', 'best_possible_accuracy', 
                'worst_possible_accuracy']:
        schema_val = mod_task_df[col].iloc[0]
        raw_val = mod_task_df[col].iloc[1]
        summary_d[f'{col}: Top P/K 0.0 - 1.0'].append(f"{schema_val:.1%} - {raw_val:.1%} = {schema_val- raw_val:.1%}")
        col_name = f'{col}'
        # summary_d[col_name].append(schema_val- raw_val)
        format_cols.add(col_name)
    
display(pd.DataFrame(summary_d).style.format({
    col: '{:.1%}' for col in format_cols
}))


Unnamed: 0.1,Unnamed: 0,model,model_config,task,task_config,TACr,TARr,TACa,TARa,correct_count_per_run,...,num_questions,N,best_possible_count,best_possible_accuracy,worst_possible_count,worst_possible_accuracy,spread,bootstrap_counts,bootstrap_pcts,date
27,27,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 'few'}",80,0.8,89,0.89,"[35, 36, 36, 36, 36, 36, 36, 36, 36, 36]",...,100,10,36,0.36,35,0.35,0.01,"[35, 35, 36, 36, 36, 36, 36, 36, 36, 36]","[0.35, 0.35, 0.36, 0.36, 0.36, 0.36, 0.36, 0.3...",2024-12-28_16-42-30
11,11,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 'few'}",76,0.76,89,0.89,"[37, 38, 39, 37, 38, 39, 37, 38, 36, 39]",...,100,10,39,0.39,34,0.34,0.05,"[36, 37, 37, 37, 37, 38, 38, 38, 38, 39]","[0.36, 0.37, 0.37, 0.37, 0.37, 0.38, 0.38, 0.3...",2025-02-13_10-50-07
59,59,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 'few'}",0,0.0,53,0.53,"[66, 63, 71, 65, 73, 63, 70, 70, 69, 66]",...,100,10,87,0.87,47,0.47,0.4,"[64, 67, 67, 67, 68, 68, 69, 69, 69, 70]","[0.64, 0.67, 0.67, 0.67, 0.68, 0.68, 0.69, 0.6...",2024-12-28_15-00-31
43,43,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",college_mathematics,"{'prompt_type': 'v2', 'shots': 'few'}",0,0.0,50,0.5,"[70, 73, 73, 68, 66, 67, 69, 65, 72, 69]",...,100,10,88,0.88,44,0.44,0.44,"[64, 65, 66, 68, 69, 70, 71, 71, 71, 73]","[0.64, 0.65, 0.66, 0.68, 0.69, 0.7, 0.71, 0.71...",2025-02-13_13-25-50
21,21,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 'few'}",140,0.56,205,0.82,"[148, 151, 149, 153, 150, 154, 151, 157, 153, ...",...,250,10,165,0.66,137,0.548,0.112,"[148, 150, 150, 152, 153, 154, 154, 154, 154, ...","[0.592, 0.6, 0.6, 0.608, 0.612, 0.616, 0.616, ...",2024-12-29_12-39-09
5,5,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 'few'}",63,0.252,157,0.628,"[155, 141, 150, 147, 138, 149, 150, 149, 147, ...",...,250,10,181,0.724,117,0.468,0.256,"[142, 145, 147, 147, 148, 149, 149, 149, 151, ...","[0.568, 0.58, 0.588, 0.588, 0.592, 0.596, 0.59...",2025-02-13_12-47-04
53,53,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 'few'}",1,0.004,158,0.632,"[172, 173, 169, 174, 170, 174, 172, 175, 166, ...",...,250,10,208,0.832,134,0.536,0.296,"[169, 169, 170, 170, 175, 175, 176, 177, 179, ...","[0.676, 0.676, 0.68, 0.68, 0.7, 0.7, 0.704, 0....",2024-12-29_12-20-19
37,37,gpt-4o,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",geometric_shapes,"{'prompt_type': 'v2', 'shots': 'few'}",0,0.0,158,0.632,"[180, 176, 170, 173, 167, 163, 172, 168, 176, ...",...,250,10,206,0.824,134,0.536,0.288,"[170, 171, 171, 172, 172, 172, 173, 175, 176, ...","[0.68, 0.684, 0.684, 0.688, 0.688, 0.688, 0.69...",2025-02-13_19-15-50
23,23,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",high_school_european_history,"{'prompt_type': 'v2', 'shots': 'few'}",160,0.969697,163,0.987879,"[134, 134, 134, 134, 134, 134, 134, 134, 134, ...",...,165,10,134,0.812121,134,0.812121,0.0,"[134, 134, 134, 134, 134, 134, 134, 134, 134, ...","[0.8121212121212121, 0.8121212121212121, 0.812...",2024-12-28_22-36-16
7,7,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0}",high_school_european_history,"{'prompt_type': 'v2', 'shots': 'few'}",117,0.709091,156,0.945455,"[134, 135, 134, 133, 134, 133, 133, 134, 133, ...",...,165,10,138,0.836364,129,0.781818,0.054545,"[132, 133, 134, 134, 134, 135, 135, 135, 135, ...","[0.8, 0.806060606060606, 0.8121212121212121, 0...",2025-02-13_11-37-11


Unnamed: 0,model,task,TARr: Top P/K 0.0 - 1.0,TARa: Top P/K 0.0 - 1.0,best_possible_accuracy: Top P/K 0.0 - 1.0,worst_possible_accuracy: Top P/K 0.0 - 1.0
0,gpt-4o,geometric_shapes,0.4% - 0.0% = 0.4%,63.2% - 63.2% = 0.0%,83.2% - 82.4% = 0.8%,53.6% - 53.6% = 0.0%
1,gpt-4o,public_relations,36.4% - 37.3% = -0.9%,90.0% - 92.7% = -2.7%,80.0% - 80.0% = 0.0%,73.6% - 73.6% = 0.0%
2,gpt-4o,high_school_european_history,7.9% - 9.1% = -1.2%,80.0% - 81.2% = -1.2%,89.7% - 89.1% = 0.6%,70.9% - 72.1% = -1.2%
3,gpt-4o,college_mathematics,0.0% - 0.0% = 0.0%,53.0% - 50.0% = 3.0%,87.0% - 88.0% = -1.0%,47.0% - 44.0% = 3.0%
4,gpt-4o,ruin_names,0.4% - 0.0% = 0.4%,94.4% - 95.2% = -0.8%,92.4% - 93.2% = -0.8%,87.2% - 88.4% = -1.2%
5,gpt-4o,professional_accounting,6.0% - 4.6% = 1.4%,70.9% - 66.7% = 4.3%,85.5% - 89.0% = -3.5%,58.5% - 57.8% = 0.7%
6,gpt-4o,navigate,36.0% - 46.0% = -10.0%,98.8% - 99.6% = -0.8%,99.2% - 98.8% = 0.4%,98.0% - 98.4% = -0.4%
7,gpt-4o,logical_deduction,35.2% - 36.8% = -1.6%,99.6% - 99.6% = 0.0%,100.0% - 100.0% = 0.0%,99.6% - 99.6% = 0.0%
8,gpt-35-turbo,geometric_shapes,56.0% - 25.2% = 30.8%,82.0% - 62.8% = 19.2%,66.0% - 72.4% = -6.4%,54.8% - 46.8% = 8.0%
9,gpt-35-turbo,public_relations,90.0% - 82.7% = 7.3%,95.5% - 87.3% = 8.2%,76.4% - 75.5% = 0.9%,73.6% - 65.5% = 8.2%


Analysis of few-shot:

For GPT-4o the variation appears to be fairly small, with the exception of navigate. We ignored navigate above but looking at the two:
- Few-shot navigate is 36.0% - 46.0% = -10.0% immediately above
- 0-shot navigate is 7.6% - 15.2% = -7.6% which seemed like part of the noise in the previous table but the few shot case suggests that there is a robust difference in the Top P/K setting for the task. 

For GPT-3.5 Turbo there does appear to have considerable variation making it easier to list the lesser variation examples as outliers:
- `public_relations` has TARr of 90.0% - 82.7% = 7.3%
- `college_mathematics` has TARr of 80.0% - 76.0% = 4.0% in the few-shot case but recall that in the 0-shot case it was 51.0% - 10.0% = 41.0%. This experiment should be re-run since it may point out bugs or flaws, this is a very suspect observation. 


## Overall observations

GPT-4o is mostly insensitive to Top_P/K choice while 0.0 is the better choice for stability for GPT-3.5 Turbo with some exceptions. 

The take home message here is that the task/configuration volatility is quite high for GPT 3.5 Turbo from simple observation and that should be measured. There are ample opportunities for chanced based effects that are statistically significant so reruns of any conclusions drawn should be done since it is fairly cheap and easy to do.