# Fallacy Search Analysis

In [74]:
%load_ext autoreload
%autoreload 2

from src.llms import LLM
from src.mafalda import get_mafalda_df, save_mafalda_df, evaluate_responses, get_llm_metrics
from src.plot import display_llm_table
import seaborn as sns

sns.set_theme()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Experiment 3.1: Fallacy Search

In [75]:
filename_e31 = 'data/mafalda_e31_v2.csv'
df_mafalda_e31 = get_mafalda_df(filename_e31)
df_mafalda_e31.head()

[2024-12-01 11:37:00] Loaded existing mafalda dataframe from data/mafalda_e31_v2.csv.


Unnamed: 0,text,labels,comments,sentences_with_labels,gpt_4o_mini_response,gpt_4o_response
0,TITLE: Endless Ledge Skip Campaign for Alts PO...,"[[155, 588, slippery slope]]","['Slippery slope: P1 = poster, A = why not jus...",{'TITLE: Endless Ledge Skip Campaign for Alts ...,fallacies=[FallacyEntry(fallacy=<Fallacy.SLIPP...,fallacies=[FallacyEntry(fallacy=<Fallacy.SLIPP...
1,"Two of my best friends are really introverted,...","[[84, 145, hasty generalization]]","[""Based on two people only, you can't draw gen...",{'Two of my best friends are really introverte...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...
2,TITLE: There is a difference between a'smurf' ...,"[[118, 265, false analogy]]","['False Analogy: X: Having an alt , Y: smurfin...",{'TITLE: There is a difference between a'smurf...,fallacies=[FallacyEntry(fallacy=<Fallacy.FALSE...,fallacies=[FallacyEntry(fallacy=<Fallacy.EQUIV...
3,TITLE: Discussion Thread (Part 3): 2020 Presid...,"[[107, 261, guilt by association], [107, 338, ...",['Circular reasoning: X = The status quo in Am...,{'TITLE: Discussion Thread (Part 3): 2020 Pres...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...
4,"America is the best place to live, because it'...","[[0, 78, circular reasoning]]",['Circular reasoning: X=America is the best pl...,"{'America is the best place to live, because i...",fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...,fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...


### Scoring and Sanity Check

In [76]:
evaluate_responses(df_mafalda_e31, confidence_threshold=0.5, add_uncovered_spans=False)

# save_mafalda_df(df_mafalda_e31, filename_e31)

[2024-12-01 11:37:00] Evaluating responses for gpt_4o_mini ...
span: Lars is responsible for some of the most memorable drum parts in metal... was what made Metallica the greatest metal band of all time, not their chops.
text: TITLE: Slayer's Dave Lombardo Shares Opinion on Metallica's Lars Ulrich
POST: I'm not a musician so bear with me here. Why the constant debate about Lars' skills? If a successful band in any genre was based only on the very best technical players there would only be a handful of bands out there to listen to right?
POST: It is fair to say that Lars, in the grand scheme of heavy metal drummers, ranks lower when it comes to technical skills, especially compared to the likes of Lombardo or Menza, and his drumming has only gotten lazier and sloppier with time. However, the drumming on the first four albums, especially AFJA, was pretty solid. The fact is, Lars is responsible for some of the most memorable drum parts in metal (like the machine gun double bass part in 

- Very few fallacy text spans cannot be matched and the corresponding fallacies will be ignored. It's not a big deal.

### Model Performance

In [77]:
ignore_llms = [llm.key for llm in [LLM.GPT_4O_MINI_IDENTIFICATION, LLM.GPT_4O_MINI_CLASSIFICATION]]
df_metrics = get_llm_metrics(df_mafalda_e31, ignore_llms)
df_metrics.sort_values('f1_l2', ascending=False).round(3)

Unnamed: 0,p_l2,r_l2,f1_l2,p_l1,r_l1,f1_l1,p_l0,r_l0,f1_l0,fallacies,confidence,mismatch_rate
gpt_4o,0.532,0.653,0.442,0.61,0.732,0.519,0.675,0.817,0.595,1.28,0.825,0.01
gpt_4o_mini,0.448,0.586,0.374,0.58,0.701,0.491,0.675,0.8,0.587,1.485,0.681,0.01


- Mean precision, recall, f1, fallacy count, and confidence rating per LLM.
- The original scoring metrics and code by Helwe et al. have been used here. However, no uncovered spans have been added, see [here](https://github.com/ChadiHelwe/MAFALDA/issues/2).
- gpt_4o_mini_identification and gpt_4o_mini_classification are fine-tuned models from the fallacy identification task (experiment 1.4) and fallacy classification task (experiment 2.1).
- The precision is inflated for models with fewer identified fallacies (e.g. fallacy_count for gpt_4o_mini_identification is much lower). 63 out of 200 texts in the gold standard have 0 annotated fallacies. If a model predicts no fallacies for a text with no annotated fallacies, precision is 1, leading to higher f1 scores.
- Both GPT-4o and GPT-4o Mini outperform the GPT-3.5 f1-score of 0.138 reported by Helwe et al. (2024) by a large margin.
- The human f1-score of 0.186 reported by Helwe et al. (2024) is also outperformed by a large margin.

In [78]:
# Number of texts with at least one labelled fallacy.
sum(df_mafalda_e31['labels'].map(len) > 0)

137

In [79]:
df_metrics_subset = get_llm_metrics(df_mafalda_e31[df_mafalda_e31['labels'].map(len) > 0], ignore_llms)
df_metrics_subset.sort_values('f1_l2', ascending=False).round(3)

Unnamed: 0,p_l2,r_l2,f1_l2,p_l1,r_l1,f1_l1,p_l0,r_l0,f1_l0,fallacies,confidence,mismatch_rate
gpt_4o,0.653,0.493,0.521,0.767,0.609,0.633,0.861,0.733,0.744,1.401,0.841,0.015
gpt_4o_mini,0.531,0.396,0.422,0.723,0.564,0.592,0.861,0.708,0.733,1.577,0.688,0.007


- By filtering out texts with no annotated fallacies, we get more reasonable results.
- GPT-4o outperforms GPT-4o Mini due to better precision and recall.
- GPT-4o identifies fewer fallacies than GPT-4o Mini on average, but with higher confidence.
- The fine-tuned models perform worse, especially gpt_4o_mini_identification. This means that the fine-tuning likely overfits on the FALLACIES dataset and does not generalize to the MAFALDA task.

In [80]:
df_latex = df_metrics.join(df_metrics_subset, rsuffix='_subset').sort_values('f1_l2', ascending=False)
col_labels = {
    'f1_l0': 'F1 Level 0',
    'f1_l1': 'F1 Level 1',
    'f1_l2': 'F1 Level 2',
    'f1_l0_subset': 'F1 Level 0 (Subset)',
    'f1_l1_subset': 'F1 Level 1 (Subset)',
    'f1_l2_subset': 'F1 Level 2 (Subset)',
}

df_latex = df_latex[col_labels.keys()]
df_latex.columns = col_labels.values()
df_latex = display_llm_table(df_latex, digits=3)
df_latex.index.name = 'Model'
df_latex

Unnamed: 0_level_0,F1 Level 0,F1 Level 1,F1 Level 2,F1 Level 0 (Subset),F1 Level 1 (Subset),F1 Level 2 (Subset)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GPT-4o,0.595,0.519,0.442,0.744,0.633,0.521
GPT-4o Mini,0.587,0.491,0.374,0.733,0.592,0.422


In [81]:
print(df_latex.to_latex(float_format="%.3f", column_format='l'+'c'*6,
                        caption='MAFALDA Performance Metrics for Fallacy Search'))

\begin{table}
\caption{MAFALDA Performance Metrics for Fallacy Search}
\begin{tabular}{lcccccc}
\toprule
 & F1 Level 0 & F1 Level 1 & F1 Level 2 & F1 Level 0 (Subset) & F1 Level 1 (Subset) & F1 Level 2 (Subset) \\
Model &  &  &  &  &  &  \\
\midrule
GPT-4o & 0.595 & 0.519 & 0.442 & 0.744 & 0.633 & 0.521 \\
GPT-4o Mini & 0.587 & 0.491 & 0.374 & 0.733 & 0.592 & 0.422 \\
\bottomrule
\end{tabular}
\end{table}

