# Fallacy Search Analysis

In [1]:
%load_ext autoreload
%autoreload 2

from src.mafalda import get_mafalda_df, save_mafalda_df, evaluate_responses, get_llm_metrics
import seaborn as sns

sns.set_theme()

## Experiment 3.1: Fallacy Search

In [2]:
filename_e31 = 'data/mafalda_e31.csv'
df_mafalda_e31 = get_mafalda_df(filename_e31)
df_mafalda_e31.head()

[2024-11-17 12:13:00] Loaded existing mafalda dataframe from data/mafalda_e31.csv.


Unnamed: 0,text,labels,comments,sentences_with_labels,gpt_4o_mini_response,gpt_4o_response,gpt_4o_mini_classification_response,gpt_4o_mini_identification_response
0,TITLE: Endless Ledge Skip Campaign for Alts PO...,"[[155, 588, slippery slope]]","['Slippery slope: P1 = poster, A = why not jus...",{'TITLE: Endless Ledge Skip Campaign for Alts ...,fallacies=[FallacyEntry(fallacy=<Fallacy.SLIPP...,fallacies=[FallacyEntry(fallacy=<Fallacy.SLIPP...,fallacies=[FallacyEntry(fallacy=<Fallacy.SLIPP...,fallacies=[]
1,"Two of my best friends are really introverted,...","[[84, 145, hasty generalization]]","[""Based on two people only, you can't draw gen...",{'Two of my best friends are really introverte...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...
2,TITLE: There is a difference between a'smurf' ...,"[[118, 265, false analogy]]","['False Analogy: X: Having an alt , Y: smurfin...",{'TITLE: There is a difference between a'smurf...,fallacies=[FallacyEntry(fallacy=<Fallacy.FALSE...,fallacies=[FallacyEntry(fallacy=<Fallacy.EQUIV...,fallacies=[FallacyEntry(fallacy=<Fallacy.FALSE...,fallacies=[FallacyEntry(fallacy=<Fallacy.FALSE...
3,TITLE: Discussion Thread (Part 3): 2020 Presid...,"[[107, 261, guilt by association], [107, 338, ...",['Circular reasoning: X = The status quo in Am...,{'TITLE: Discussion Thread (Part 3): 2020 Pres...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.HASTY...,fallacies=[FallacyEntry(fallacy=<Fallacy.APPEA...,fallacies=[]
4,"America is the best place to live, because it'...","[[0, 78, circular reasoning]]",['Circular reasoning: X=America is the best pl...,"{'America is the best place to live, because i...",fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...,fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...,fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...,fallacies=[FallacyEntry(fallacy=<Fallacy.CIRCU...


### Scoring and Sanity Check

In [9]:
evaluate_responses(df_mafalda_e31, confidence_threshold=0.5, add_uncovered_spans=False)

save_mafalda_df(df_mafalda_e31, filename_e31)

[2024-11-17 12:13:15] Evaluating responses for gpt_4o_mini ...
span: Check [fightthenewdrug.org] for studies on the harmful effects of pornography.
text: TITLE: Should I [23F] urge my husband [26M] to stop watching porn? POST: Yes. Porn has gained the support of popular culture, but it is unhealthy. It has been shown to make men more sexually aggressive, dissatisfied with their own sex lives, and contributes to the global sex trade. Viewing porn also causes prolonged dopamine exposure, which the brain will build resistance against - making it more difficult to feel joy from other daily activities. Porn is also highly addictive, and has diminishing returns over time, so a user will have to watch more porn, or more extreme porn, to get the same amount of dopamine. Check [fightthenewdrug.org](<URL>) for studies on the harmful effects of pornography. POST: Yeah, I always go to religious organizations to supply me with clear and objective scientific studies...

span: If we observe the incre

- Very few fallacy text spans cannot be matched and the corresponding fallacies will be ignored. It's not a big deal.

In [4]:
metrics = get_llm_metrics(df_mafalda_e31)
metrics.sort_values('f1', ascending=False).round(3)

Unnamed: 0,precision,recall,f1,fallacy_count,confidence,mismatch_rate
gpt_4o,0.464,0.661,0.401,1.485,0.836,0.005
gpt_4o_mini_identification,0.974,0.399,0.399,0.24,0.469,0.0
gpt_4o_mini,0.433,0.578,0.365,1.64,0.685,0.01
gpt_4o_mini_classification,0.399,0.525,0.275,1.475,0.724,0.01


- Mean precision, recall, f1, fallacy count, and confidence rating per LLM.
- The original scoring metrics and code by Helwe et al. have been used here. However, no uncovered spans have been added, because this underestimates performance, see [here](https://github.com/ChadiHelwe/MAFALDA/issues/2).
- gpt_4o_mini_identification and gpt_4o_mini_classification are fine-tuned models from the fallacy identification task (experiment 1.4) and fallacy classification task (experiment 2.1).
- The precision is inflated for models with fewer identified fallacies (e.g. fallacy_count for gpt_4o_mini_identification is much lower). 63 out of 200 texts in the gold standard have 0 annotated fallacies. If a model predicts no fallacies for a text with no annotated fallacies, precision is 1, leading to higher f1 scores.
- Both GPT-4o and GPT-4o Mini outperform the GPT-3.5 f1-score of 0.138 reported by Helwe et al. (2024) by a large margin.
- The human f1-score of 0.186 reported by Helwe et al. (2024) is also outperformed by a large margin.

In [5]:
# Number of texts with at least one labelled fallacy.
sum(df_mafalda_e31['labels'].map(len) > 0)

137

In [6]:
metrics = get_llm_metrics(df_mafalda_e31[df_mafalda_e31['labels'].map(len) > 0])
metrics.sort_values('f1', ascending=False).round(3)

Unnamed: 0,precision,recall,f1,fallacy_count,confidence,mismatch_rate
gpt_4o,0.612,0.505,0.52,1.591,0.854,0.007
gpt_4o_mini,0.508,0.385,0.409,1.737,0.695,0.007
gpt_4o_mini_classification,0.539,0.307,0.358,1.394,0.722,0.007
gpt_4o_mini_identification,0.962,0.122,0.122,0.35,0.469,0.0


- By filtering out texts with no annotated fallacies, we get more reasonable results.
- GPT-4o outperforms GPT-4o Mini due to better precision and recall.
- GPT-4o identifies fewer fallacies than GPT-4o Mini on average, but with higher confidence.
- The fine-tuned models perform worse, especially gpt_4o_mini_identification. This means that the fine-tuning likely overfits on the FALLACIES dataset and does not generalize to the MAFALDA task.

In [7]:
mafalda = df_mafalda_e31.to_dict(orient='records')