In [None]:
import pandas as pd
import os

In [None]:
bpath = "../e-cal/gpt-cfa/attempts/"
suffixes = ["l1", "l2", "l1/cot", "l2/cot"] #"l1/cotam", "l2/cotam"]
model_names = ["chatgpt", "gpt4"]
all_evals = {}

for suffix in suffixes:
    all_suffix_results = {
        model_name: [] for model_name in model_names
    }
    suffixpath = os.path.join(bpath, suffix)
    for file in os.listdir(suffixpath):
        filepath = os.path.join(suffixpath, file)
        if not os.path.isdir(filepath) and ".csv" in file:
            for model_name in all_suffix_results:
                if model_name in file:
                    df = pd.read_csv(filepath)
                    df.correct = df.correct.apply(lambda v: 1 if v == "yes" else 0)
                    all_suffix_results[model_name].append(df)
                    break
    for model_name in model_names:
        try:
            all_suffix_results[model_name] = pd.concat(all_suffix_results[model_name], axis=0)
        except:
            del all_suffix_results[model_name]
    all_evals[suffix] = all_suffix_results

# Error analysis:

## 1. On l1 exams

### a - Understanding CoT performance discrepancies

#### (i) ChatGPT

In [None]:
chatgpt = all_evals["l1"]["chatgpt"]
chatgpt_cot = all_evals["l1/cot"]["chatgpt"]
failures_chatgpt = chatgpt[chatgpt.correct < 1]
failures_chatgpt_cot = chatgpt_cot[chatgpt_cot.correct < 1]

In [None]:
failed_by_chatgpt_cot_but_not_zeroshot = failures_chatgpt_cot[~failures_chatgpt_cot.id.isin(failures_chatgpt.id)]

In [None]:
len(failed_by_chatgpt_cot_but_not_zeroshot)

In [None]:
# k = 28 # 28
# r = failed_by_chatgpt_cot_but_not_zeroshot.iloc[k]
k = 235
r = failed_by_chatgpt_cot_but_not_zeroshot[failed_by_chatgpt_cot_but_not_zeroshot.id == k].iloc[0]
print(f"ID: {r.id}\tQUESTION: {r.question}")
print(f"A. {r.choice_a}\tB. {r.choice_b}\tC. {r.choice_c}")
print(f"ANSWER: {r.answer}\tEXPLANATION: {r.explanation}")
print("#############\n")
print(f"GUESS: {r.guess}")
print(f"THINKING: {r.thinking}")

Mostly computation questions again

- 391: (computations) Was not able to correctly use the evidence? Thus, incorrect reasoning, and also incorrect approx at the end
- 406: (knowledge) The steps are correct, but the final answer selected is not... (not enough reasoning)
- 407: (computations) The steps are correct, but the computation is wrong (answer is rounded to just 1 significative nb which causes the pb?)
- 408: (computations) The steps are incorrect; used too much evidence tho the computations were correct
- 409: (computations) Uses the correct formula, but the steps that lead to the values populated in the formula are wrong => not using the evidence super well?
- 410: (knowledge) Steps of reasoning are correct, but the model seemingly did not really understand that option A (selected) was negated?
- 411: (knowledge) Either ChatGPT hallucinated some facts about agents II in GIPS in its definitions, or the question is very treacherous
- 412: (knowledge) One of the step of the reasoning is incorrect (ChatGPT extrapolates a bit too much the provided info)
- 413: (knowledge) One of the step of the reasoning is incorrect (ChatGPT extrapolated a bit too much the provided info)
- 782: (computations) Wrong formula of covariance 
- 752: (computations) Steps are correct, formula is correct, result is correct, but picked the wrong option?
- 781: (computations) Steps are correct, formula is correct, but some of the calculations are not returning the correct values (see Year 1)
- 769: (computations) Steps are correct, formula is correct, result is correct, but picked the wrong option?
- 770: (applied knowledge) introduced some knowledge, but DID NOT read correctly option C (which says permanent =/= temporary => ChatGPT did not necessarily use the right reasoning/knowledge to make its guess)
- 779: (computations & applied knowledge) considered the right formula, estimated the tax benefit well, but did not conclude correctly?
- 735: (knowledge) made the right analysis of each evidenced comment, but did not make the right conclusion based on this analysis (ended up saying both were violations)
- 474: (knowledge) reasoning unclear? Might have hallucinated some info about DCF?
- 507: (computations) formula incomplete, lacking some computation steps to get final expected result
- 455: (computations) considered the right formula, obtained the right results for calculations, but did not conclude correctly?
- 478: (computations) correct steps, but incorrect formulas/computations => wrong result
- 224: (computations) correct steps, correct formula, correct numbers, but final result of the calculations incorrect
- 255: (knowledge) wrong knowledge involved (steps seem legit, but the LLM does not seem to be invocating CFA-specific knowledge => hallucinations)
- 264: (computations) used the right formula, did all the right calculations and got the right result, yet selected the wrong answer...
- 235: (computations) used the right formula, did all the right calculations and got the right result, yet selected the wrong answer...
- 189: (knowledge) did not invocate any CFA-specific knowledge (standards of practice) in order to answer => incorrect
- 343: (knowledge) incorrect reasoning (cited the right definition, but did not seem to understand it fully in order to elect the right answer)
- 290: (computations) used the right formula, did all the right calculations and got the right result, yet selected the wrong answer...
- 296: (knowledge) incorrect reasoning; hallucinated definitions/knowledge instead of invocating the correct CFA knowledge
- 69: (computations) correct reasoning, correct calculations and results, but did not read correctly the final ratio/fraction it obtained

Seems like, with CoT, ChatGPT sometimes got confused with its own explanations/computations and returned the wrong answer (VS zeroshot; issue with zeroshot is that it could very well be simply guessing the right answer => unclear/blackbox)
=> we're exposing ourselves more to extrapolations, exagerations, hallucinations, when using CoT on a domain an LLM is not super comfortable in it seems.

**Conclusion: Mostly errors linked to inaccuracies in the computation-related questions. Some cases where it is not relying on the right knowledge to answer. Some other cases where everything is correct almost until the end of the reasoning, but then a mistake is made or the "answer" function is not used correctly.**

In [None]:
failed_by_chatgpt_zeroshot_but_not_cot = failures_chatgpt[~failures_chatgpt.id.isin(failures_chatgpt_cot.id)]

In [None]:
len(failed_by_chatgpt_zeroshot_but_not_cot)

Obviously, given the initial strange observation we made when comparing performance, there are a bit less failure cases from zeroshot that were correctly answered by CoT (as opposed to what everyone reports regarding CoT being better than zeroshot). 

#### (ii) GPT4

In [None]:
gpt4 = all_evals["l1"]["gpt4"]
gpt4_cot = all_evals["l1/cot"]["gpt4"]
failures_gpt4 = gpt4[gpt4.correct < 1]
failures_gpt4_cot = gpt4_cot[gpt4_cot.correct < 1]

In [None]:
failed_by_gpt4_zeroshot_but_not_cot = failures_gpt4[~failures_gpt4.id.isin(failures_gpt4_cot.id)]

In [None]:
len(failed_by_gpt4_zeroshot_but_not_cot)

In [None]:
failed_by_gpt4_zeroshot_but_not_cot = failures_gpt4_cot[~failures_gpt4_cot.id.isin(failures_gpt4.id)]

In [None]:
len(failed_by_gpt4_zeroshot_but_not_cot)

Here, once again, the performance difference between CoT and no CoT is very small, so the numbers are not that different, even though we observe that GPT4_CoT > GPT_zeroshot here, as in the performance comparison. 

# Draft

In [None]:
all_evals["l1"]["chatgpt"].correct.apply(lambda v: 1 if v == "yes" else 0).mean()

In [None]:
all_evals["l1"]["gpt4"].correct.apply(lambda v: 1 if v == "yes" else 0).mean()

In [None]:
all_evals["l1"]["chatgpt"]

In [None]:
all_evals["l1"]["gpt4"]