## Impact of fine tuning on determinism

In the V2 paper we found that determiniminsm was much higher for fine tuned models but they still were not perfect. Below is a partial recreation of the V2 experiment that we stopped because it was too expensive and we were getting the same results as before. 

We also have learned a great deal about how LLMs are operated and suspected that the fine tuning result was not due to the model being refined to the task but rather that fine tuned models cannot be run on other customer jobs leading to determinism due to actual equivalent inputs across runs. 

Below are partial results for experments we ran for v3.

## Finetuning approach

Running `python run_fine_tuned.py` will do the following:

1. For each task in `TASKS = ['navigate', 'logical_deduction']`, if there is not a fine tuned model already created as indicated by the `model_map.csv`, ChatGPT3.5 is fine tuned with 2-fold cross validation on the first 100 rubrics with an even and odd folds. For each fold, e.g., even, the first 40 even numbered rubric are used to fine tune and the remaining 10 are used as a validation set for the fine tuner. 

2. For evaluation we attempted to run each task 10 times with the odd-finetuned model used to answer even questions and visa-versa. 

3. We ran out of funds on both runs so there are not 10 runs but the results are clear. Fine tuning increases TARr remarkably but it does not achieve determinism. 

Results below:

In [6]:
import pandas as pd
pd.options.display.max_columns = 250 #(default is 20)
pd.options.display.max_rows = 250 #(default is 60)
pd.options.display.max_colwidth = 250 #(default is 50)

# stability_eval.csv created by running `python ../../evaluate.py runs/`
df = pd.read_csv('stability_eval.csv')
display(df.sort_values(by='task'))

Unnamed: 0.1,Unnamed: 0,model,model_config,task,task_config,TACr,TARr,TACa,TARa,correct_count_per_run,correct_pct_per_run,num_questions,N,best_possible_count,best_possible_accuracy,worst_possible_count,worst_possible_accuracy,spread,bootstrap_counts,bootstrap_pcts,date
0,0,gpt-35_OAI,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0, 'even_model': 'ft:gpt-3.5-turbo-0125:personal::BB4C5N1v', 'odd_model': 'ft:gpt-3.5-turbo-0125:personal::BB4ImGvG'}",logical_deduction,"{'prompt_type': 'v2', 'shots': 'few', 'fine_tuned': True}",249,99.6%,249,99.6%,"[111, 111, 110, 110, 110, 111, 110]","['44.4%', '44.4%', '44.0%', '44.0%', '44.0%', '44.4%', '44.0%']",250,7,111,44.4%,110,44.0%,0.4%,"[110, 110, 110, 110, 110, 110, 110, 110, 111, 111]","['44.0%', '44.0%', '44.0%', '44.0%', '44.0%', '44.0%', '44.0%', '44.0%', '44.4%', '44.4%']",2025-03-14_14-30-21
2,2,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",logical_deduction,"{'prompt_type': 'v2', 'shots': 'few'}",199,79.6%,248,99.2%,"[224, 224, 224, 225, 225, 223, 224, 223, 223, 223]","['89.6%', '89.6%', '89.6%', '90.0%', '90.0%', '89.2%', '89.6%', '89.2%', '89.2%', '89.2%']",250,10,225,90.0%,223,89.2%,0.8%,"[223, 223, 224, 224, 224, 224, 224, 224, 224, 225]","['89.2%', '89.2%', '89.6%', '89.6%', '89.6%', '89.6%', '89.6%', '89.6%', '89.6%', '90.0%']",2024-12-20_19-52-06
1,1,gpt-35_OAI,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0, 'even_model': 'ft:gpt-3.5-turbo-0125:personal::BAyYlUwJ', 'odd_model': 'ft:gpt-3.5-turbo-0125:personal::BAymJYaT'}",navigate,"{'prompt_type': 'v2', 'shots': 'few', 'fine_tuned': True}",250,100.0%,250,100.0%,"[163, 163, 163, 163, 163, 163]","['65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%']",250,6,163,65.2%,163,65.2%,0.0%,"[163, 163, 163, 163, 163, 163, 163, 163, 163, 163]","['65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%']",2025-03-14_09-09-18
3,3,gpt-35-turbo,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 0.0}",navigate,"{'prompt_type': 'v2', 'shots': 'few'}",217,86.8%,250,100.0%,"[240, 240, 240, 240, 240, 240, 240, 240, 240, 240]","['96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%']",250,10,240,96.0%,240,96.0%,0.0%,"[240, 240, 240, 240, 240, 240, 240, 240, 240, 240]","['96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%', '96.0%']",2024-12-15_21-25-41


The `gpt-35_OAI` models are the fine tuned ones and there is actual determinism for `navigate` and near determinism for `logical_deducation`. There is a considerable drop in performance for the fine tuned models so there may be issuess with the implementation--this is not expected. In the V2 experiments, the fine tuned models did not show such degredation (they were not fine-tuned the same way) but we did see the increase in determinism. 

## Are we seeing determinsm due to being only data being processed?

It is possible that we got increased determinism just by being the only job on the LLM? 
To test this we ran logical deduction on the navigation fine tuned model and visa versa. 

In [7]:
import pandas as pd
pd.options.display.max_columns = 250 #(default is 20)
pd.options.display.max_rows = 250 #(default is 60)
pd.options.display.max_colwidth = 250 #(default is 50)

# stability_eval_cross_trained.csv created by running `python ../../evaluate.py cross_trained/`
# and then `mv stability_eval.csv stability_eval_cross_trained.csv`
df = pd.read_csv('stability_eval_cross_trained.csv')
display(df.sort_values(by='task'))

Unnamed: 0.1,Unnamed: 0,model,model_config,task,task_config,TACr,TARr,TACa,TARa,correct_count_per_run,correct_pct_per_run,num_questions,N,best_possible_count,best_possible_accuracy,worst_possible_count,worst_possible_accuracy,spread,bootstrap_counts,bootstrap_pcts,date
1,1,gpt-35_OAI,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0, 'even_model': 'ft:gpt-3.5-turbo-0125:personal::BAyYlUwJ', 'odd_model': 'ft:gpt-3.5-turbo-0125:personal::BAymJYaT'}",logical_deduction,"{'prompt_type': 'v2', 'shots': 'few', 'fine_tuned': True}",249,99.6%,249,99.6%,"[118, 118, 118, 117]","['47.2%', '47.2%', '47.2%', '46.8%']",250,4,118,47.2%,117,46.8%,0.4%,"[117, 117, 117, 118, 118, 118, 118, 118, 118, 118]","['46.8%', '46.8%', '46.8%', '47.2%', '47.2%', '47.2%', '47.2%', '47.2%', '47.2%', '47.2%']",2025-03-15_20-37-46
0,0,gpt-35_OAI,"{'temperature': 0.0, 'seed': 12, 'top_p_k': 1.0, 'even_model': 'ft:gpt-3.5-turbo-0125:personal::BB4C5N1v', 'odd_model': 'ft:gpt-3.5-turbo-0125:personal::BB4ImGvG'}",navigate,"{'prompt_type': 'v2', 'shots': 'few', 'fine_tuned': True}",249,99.6%,249,99.6%,"[162, 163, 163, 163, 163]","['64.8%', '65.2%', '65.2%', '65.2%', '65.2%']",250,5,163,65.2%,162,64.8%,0.4%,"[163, 163, 163, 163, 163, 163, 163, 163, 163, 163]","['65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%', '65.2%']",2025-03-15_19-58-38


We see nearly the same level of determinsm so the hypothesis that fine tuning helps determinism looks more likely to be due to being only job on the LLM given that a poorly trained LLM is equally deterministic. But the non-drop in accuracy indicates that there may be bugs/problems with the approach. Out of time/resources so not pursuing. 