# Simulated Argumentative Discourse for Improved Fact Verification: A Multi-Agent Reasoning Approach

## Milestone

### Config Environment

In [None]:
!pip install transformers accelerate sentencepiece --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Config Model

We chose to use Google's FLAN-T5-Large, a 780M parameter instruction-tuned language model based on T5.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


def generate_response(prompt, max_tokens=128, temperature=0.9):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    outputs = model.generate(
        input_ids,
        max_length=max_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=0.95,
        repetition_penalty=1.2
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Debate Pipeline

In [None]:
def create_agent_prompt(agent, claim, evidence, history):
    """Creates a structured prompt for debate agents to either support or refute a claim"""

    role = "support" if agent == "A" else "refute"
    opponent = "B" if agent == "A" else "A"

    instructions = f"""Your role is to {role} the claim using the evidence and respond thoughtfully to your opponent's arguments.
Reference specific points made by Agent {opponent} where appropriate, and try to persuade a neutral judge.
Limit your response to 3 sentences."""

    prompt = f"""Claim: "{claim}"
Evidence: "{evidence}"

{instructions}

Debate so far:
"""
    for i, (speaker, utterance) in enumerate(history):
        prompt += f"Agent {speaker}: {utterance}\n"

    prompt += f"Agent {agent}:"
    return prompt


def simulate_debate(claim, evidence, rounds=3, starting_agent="A"):
    """Simulates a structured debate between two agents over multiple rounds"""

    history = []
    current_agent = starting_agent

    for i in range(rounds * 2):   # in each round, each agent will speak
        prompt = create_agent_prompt(current_agent, claim, evidence, history)
        response = generate_response(prompt)
        history.append((current_agent, response.strip()))
        current_agent = "B" if current_agent == "A" else "A"

    return history

def create_final_judgment_prompt(claim, evidence, history):
    """Creates a prompt for final judgment on whether evidence supports the claim"""

    prompt = f"""Claim: "{claim}"
Evidence: "{evidence}"

Here is a debate about whether the evidence supports the claim:\n"""

    for speaker, utterance in history:
        prompt += f"Agent {speaker}: {utterance}\n"

    prompt += """\nFinal Task: Based on the above, classify the claim as one of the following:
- SUPPORTS
- REFUTES
- NOT ENOUGH INFO

Answer:"""
    return prompt


This example demonstrates a structured debate system where two AI agents argue opposing sides of a claim, followed by a final judgment.

In [None]:
claim = "The Eiffel Tower is located in Berlin."
evidence = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France"

debate_history = simulate_debate(claim, evidence)
for turn in debate_history:
    print(f"Agent {turn[0]}: {turn[1]}\n")

judgment_prompt = create_final_judgment_prompt(claim, evidence, debate_history)
final_decision = generate_response(judgment_prompt)

print("\nFinal Decision:", final_decision.strip())


Agent A: The Eiffel Tower has no historical or geographical context and is located at the Champ de Mars in Paris, France.

Agent B: The Eiffel Tower was built on the Champ de Mars, in Paris, France.

Agent A: The Eiffel Tower is a tower that was built in Paris, France. It had no historical or geographical context.

Agent B: The Eiffel Tower is located on the Champ de Mars, in Paris, France. It has no historical or geographical context.

Agent A: I think it is a reasonable point.

Agent B: The Eiffel Tower has no geographical context either.


🧑‍⚖️ Final Decision: REFUTES


We took a few random examples from the dataset to test on the model.



In [None]:
fever_examples = [
    {
        "claim": "Gabrielle Union was in a movie.",
        "label": "SUPPORTS",
        "evidence": "She co-starred in film The Birth of a Nation (2016), and next appeared in Almost Christmas (2016) and Sleepless (2017)."
    },
    {
        "claim": "Eleveneleven was founded by a chef.",
        "label": "REFUTES",
        "evidence": "eleveneleven is a record label founded in 2010 by Mike Hamlin, Ellen DeGeneres and her production company, in association with Warner Bros."
    },
    {
        "claim": "Cosmos: A Spacetime Odyssey secured studio support thanks to Seth MacFarlane.",
        "label": "NOT ENOUGH INFO",
        "evidence": "MacFarlane served as executive producer of Cosmos: A Spacetime Odyssey, hosted by Neil deGrasse Tyson."
    },
    {
        "claim": "Chokher Bali was nominated for a Golden Leopard in 2003.",
        "label": "SUPPORTS",
        "evidence": "Chokher Bali was nominated for the Golden Leopard (Best Film) at the Locarno International Film Festival in 2003."
    },
    {
        "claim": "Usain Bolt won at the Olympics.",
        "label": "SUPPORTS",
        "evidence": "Bolt won Olympic 100m and 200m titles at three consecutive Olympics: 2008, 2012, and 2016."
    },
    {
        "claim": "Tennis is not a sport.",
        "label": "REFUTES",
        "evidence": "Tennis is played by millions of recreational players and is also a popular worldwide spectator sport."
    }
]


In [None]:
def simulate_on_fever(data):
   """Processes FEVER samples through debate simulation and prediction"""
   results = []

   for sample in tqdm(data):
       claim = sample["claim"]
       evidence = sample["evidence"]
       label = sample["label"]

       try:
           debate = simulate_debate(claim, evidence, rounds=3)
           final_prompt = create_final_judgment_prompt(claim, evidence, debate)
           prediction = generate_response(final_prompt).strip().upper()

           if "SUPPORTS" in prediction:
               predicted_label = "SUPPORTS"
           elif "REFUTES" in prediction:
               predicted_label = "REFUTES"
           else:
               predicted_label = "NOT ENOUGH INFO"

           results.append({
               "claim": claim,
               "label": label,
               "predicted": predicted_label,
               "match": predicted_label == label
           })

       except Exception as e:
           print("Error with claim:", claim)
           print(e)

   return results


100%|██████████| 6/6 [03:13<00:00, 32.18s/it]


In [None]:
results = simulate_on_fever(fever_examples)
correct = sum(1 for r in results if r["match"])
accuracy = correct / len(results)

print(f"Accuracy: {accuracy:.2%}\n")

# הצגת שגיאות
print("Mistakes:")
for r in results:
    if not r["match"]:
        print(f"- Claim: {r['claim']}\n  Label: {r['label']} | Predicted: {r['predicted']}\n")



🎯 Accuracy: 66.67%

❌ Mistakes:
- Claim: Cosmos: A Spacetime Odyssey secured studio support thanks to Seth MacFarlane.
  Gold: NOT ENOUGH INFO | Predicted: SUPPORTS

- Claim: Tennis is not a sport.
  Gold: REFUTES | Predicted: SUPPORTS



### Full dataset

Loading the dataset from Hugging Face


In [None]:
import pandas as pd

splits = {
    'train': 'train.jsonl',
    'validation': 'valid.jsonl',
    'test': 'test.jsonl'
}

df = pd.read_json("hf://datasets/copenlu/fever_gold_evidence/" + splits["train"], lines=True)


In [None]:
def extract_evidence_text(evidence_entry):
    """Extracts and concatenates text content from nested evidence structure"""
    texts = []
    for group in evidence_entry:
        for sentence in group:
            if isinstance(sentence, list) and len(sentence) > 2:
                texts.append(sentence[2])  # texts shape [page, sent_id, text]
    return " ".join(texts)

df["evidence_text"] = df["evidence"].apply(extract_evidence_text)


Running example on 50 verifiable claims (meaning with a "SUPPORTS or "REFUTES" label).

In [None]:
df_filtered = df[df["label"].isin(["SUPPORTS", "REFUTES"])].copy()
df_sample = df_filtered.sample(n=50, random_state=42)

First running with the debate method

In [None]:
results = []

for i, row in df_sample.iterrows():
    claim = row["claim"]
    evidence = row["evidence_text"]
    gold_label = row["label"]

    try:
        debate = simulate_debate(claim, evidence, rounds=3)
        final_prompt = create_final_judgment_prompt(claim, evidence, debate)
        prediction = generate_response(final_prompt).strip().upper()

        if "SUPPORTS" in prediction:
            predicted_label = "SUPPORTS"
        elif "REFUTES" in prediction:
            predicted_label = "REFUTES"
        else:
            predicted_label = "NOT ENOUGH INFO"

        results.append({
            "claim": claim,
            "gold": gold_label,
            "predicted": predicted_label,
            "match": predicted_label == gold_label
        })

    except Exception as e:
        print(f"Error with claim: {claim}\n{e}")


In [None]:
results_df = pd.DataFrame(results)
accuracy = results_df["match"].mean()

print(f"Accuracy on sample: {accuracy:.2%}")
print("\nMistakes:")
print(results_df[~results_df["match"]][["claim", "gold", "predicted"]].head())



🎯 Accuracy on sample: 80.00%

❌ Mistakes:
                                                claim      gold  \
1   Michael Keaton has avoided film and TV product...   REFUTES   
4                        Red is a action comedy poem.   REFUTES   
7                   Charles II of England was a king.  SUPPORTS   
9                   T-Pain has yet to found anything.   REFUTES   
11       Amazon Web Services lacked deployment tools.   REFUTES   

          predicted  
1   NOT ENOUGH INFO  
4          SUPPORTS  
7           REFUTES  
9   NOT ENOUGH INFO  
11         SUPPORTS  


Then running with direct response

In [None]:
def generate_direct_prediction(claim, evidence):
    prompt = f"""Claim: "{claim}"
Evidence: "{evidence}"

Task: Based on the evidence, classify the claim as one of the following:
- SUPPORTS
- REFUTES
- NOT ENOUGH INFO

Answer:"""

    prediction = generate_response(prompt).strip().upper()

    if "SUPPORTS" in prediction:
        return "SUPPORTS"
    elif "REFUTES" in prediction:
        return "REFUTES"
    else:
        return "NOT ENOUGH INFO"


In [None]:
results_direct = []

for i, row in df_sample.iterrows():
    claim = row["claim"]
    evidence = row["evidence_text"]
    label = row["label"]

    try:
        predicted_label = generate_direct_prediction(claim, evidence)

        results_direct.append({
            "claim": claim,
            "label": label,
            "predicted": predicted_label,
            "match": predicted_label == gold_label
        })

    except Exception as e:
        print(f"Error in direct prediction: {claim}\n{e}")


In [None]:
# comparing results
df_debate = pd.DataFrame(results).rename(columns={"match": "match_debate", "predicted": "predicted_debate"})
df_direct = pd.DataFrame(results_direct).rename(columns={"match": "match_direct", "predicted": "predicted_direct"})

# merge by claim
merged = df_debate.merge(df_direct, on="claim")

# acc calculation
acc_debate = merged["match_debate"].mean()
acc_direct = merged["match_direct"].mean()

print(f"Debate Accuracy: {acc_debate:.2%}")
print(f"Direct Accuracy: {acc_direct:.2%}")

# comparing disagreements
diffs = merged[merged["predicted_debate"] != merged["predicted_direct"]]
print("\nDisagreements between methods:")
print(diffs[["claim", "label", "predicted_debate", "predicted_direct"]].head())


🎯 Debate Accuracy: 80.00%
🎯 Direct Accuracy: 66.00%

🔍 Disagreements between methods:
                                                claim gold_label  \
1   Michael Keaton has avoided film and TV product...    REFUTES   
7                   Charles II of England was a king.   SUPPORTS   
9                   T-Pain has yet to found anything.    REFUTES   
10         Maggie Gyllenhaal is exclusively a singer.    REFUTES   
11       Amazon Web Services lacked deployment tools.    REFUTES   

   predicted_debate predicted_direct  
1   NOT ENOUGH INFO          REFUTES  
7           REFUTES         SUPPORTS  
9   NOT ENOUGH INFO         SUPPORTS  
10          REFUTES         SUPPORTS  
11         SUPPORTS  NOT ENOUGH INFO  


## Next Steps

*   Fine Tuning:
*   Ablations:
*   Analysis:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/University/ANLP/Project')

!pip install -U datasets

In [5]:
!python run_experiments.py --sample-size 3

INFO:datasets:PyTorch version 2.6.0+cu124 available.
INFO:datasets:Polars version 1.21.0 available.
INFO:datasets:Duckdb version 1.2.2 available.
INFO:datasets:TensorFlow version 2.18.0 available.
INFO:datasets:JAX version 0.5.2 available.
INFO:__main__:Starting experiment at 20250629_151953
INFO:__main__:Sample size: 3, Model: google/flan-t5-large
INFO:__main__:Loading FEVER evaluation data...
INFO:fever_loader:Loading FEVER validation split...
README.md: 5.19kB [00:00, 13.3MB/s]
train.jsonl: 100% 96.9M/96.9M [00:01<00:00, 56.5MB/s]
valid.jsonl: 6.47MB [00:00, 20.5MB/s]
test.jsonl: 6.63MB [00:00, 19.6MB/s]
Generating train split: 100% 228277/228277 [00:01<00:00, 158010.05 examples/s]
Generating validation split: 100% 15935/15935 [00:00<00:00, 174264.17 examples/s]
Generating test split: 100% 16039/16039 [00:00<00:00, 122346.68 examples/s]
INFO:fever_loader:Loaded 15935 examples from validation
INFO:fever_loader:Processed 15873 examples with evidence
INFO:fever_loader:Created balanced 