# ImpPres with LLM

You have to implement in this notebook a better ImpPres classifier using an LLM.
This classifier must be implemented using DSPy.


## Load ImpPres Dataset

In [30]:
from datasets import load_from_disk
from collections import defaultdict

unified_pres = load_from_disk("unified_presupposition.hf")

section_paradigm_rows = defaultdict(lambda: defaultdict(list))
for item in unified_pres:
    section_paradigm_rows[item['section']][item['paradigmID']].append(item)

train_dataset = [
    paradigm_rows
    for section_paradigms in section_paradigm_rows.values()
    for paradigm_id, paradigm_rows in section_paradigms.items()
    if paradigm_id == 0
]
test_dataset = [
    paradigm_rows
    for section_paradigms in section_paradigm_rows.values()
    for paradigm_id, paradigm_rows in section_paradigms.items()
    if paradigm_id != 0
]

## Implement DsPy Programs

In [3]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
import os
import dspy
from dotenv import load_dotenv

load_dotenv()

lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

In [5]:
from typing import Literal

class PresuppositionParadigm(dspy.Signature):
    """
    Given a list of premises and hypotheses that are multiple variants of the same (premise, hypothesis) pair, identify whether each premise entails, contradicts, or is neutral with respect to the corresponding hypothesis.
    """
    premises_hypotheses: list[tuple[str, str]] = dspy.InputField(desc="A list of tuples where each tuple contains a premise that is assumed to be true and a hypothesis that is evaluated with relation to the premise.")
    presupposes: list[Literal['entailment', 'contradiction', 'neutral']] = dspy.OutputField(desc="The relationship between each premise and hypothesis in each tuple, indicating whether the premise entails, contradicts, or is neutral with respect to the hypothesis.")


In [6]:
pres = dspy.Predict(PresuppositionParadigm)
ans = pres(premises_hypotheses=[
    ("Colleen was only biking to that library.", "Colleen was biking to that library."),
    ("Colleen was only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen was only biking to that library.", "Tanya was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen wasn't only biking to that library.", "Tanya was biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen was biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen wasn't biking to that library."),
    ("Was Colleen only biking to that library?", "Tanya was biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen was biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen might have been only biking to that library.", "Tanya was biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen was biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen wasn't biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Tanya was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen was only biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen was only biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen was only biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen was only biking to that library.")
])
print(ans.presupposes)

['entailment', 'contradiction', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'contradiction', 'neutral', 'neutral', 'neutral']


In [32]:
import random
from dspy.teleprompt import BootstrapFewShot
from collections import defaultdict

label_names = ['entailment', 'contradiction', 'neutral']
random.seed(42)  # For reproducibility

def validate_answer(example, pred, trace=None):
    """Validation function for DSPy optimization"""
    acc = 0
    consistency_buckets = [0, 0, 0]
    for gold, pred in zip(example.presupposes, pred.presupposes):
        if gold == pred:
            acc += 1
        consistency_buckets[label_names.index(pred)] += 1
    
    acc /= len(example.presupposes)
    consistency = max(consistency_buckets) / len(example.presupposes)

    return 0.8 * acc + 0.2 * consistency  # Weighted average of accuracy and consistency

# Configure the teleprompter for optimization
teleprompter = BootstrapFewShot(
    metric=validate_answer,
    metric_threshold=0.8,
    max_bootstrapped_demos=1,  # Number of examples to bootstrap
    max_labeled_demos=2,       # Maximum number of labeled demonstrations
    max_rounds=3               # Number of optimization rounds
)

def cut_data_premise_round_robin(dataset, max_paradigms, complete_round=False, shuffle=True):
    section_paradigms_rows = defaultdict(list)
    for paradigm_rows in dataset:
        section_paradigms_rows[paradigm_rows[0]['section']].append(paradigm_rows)

    subset = []
    i = 0
    added = True
    while added and len(subset) < max_paradigms:
        added = False
        for paradigms_rows in section_paradigms_rows.values():
            if not complete_round and len(subset) == max_paradigms:
                break
            added = True
            if shuffle:
                subset.append(random.sample(paradigms_rows[i], len(paradigms_rows[i])))
            else:
                subset.append(paradigms_rows[i])
        i += 1
    return subset

def prepare_dspy_examples(max_paradigms: int = 3):
    """Round robin sampling of examples from diffrent sections of the training dataset."""
    subset = cut_data_premise_round_robin(train_dataset, max_paradigms)
    return [
        dspy.Example(
            premises_hypotheses=[(item['premise'], item['hypothesis']) for item in paradigm_items],
            presupposes=[label_names[item['gold_label']] for item in paradigm_items]
        ).with_inputs('premises_hypotheses')
        for paradigm_items in subset
    ]

pres_fewshot = teleprompter.compile(pres, trainset=prepare_dspy_examples())
ans = pres_fewshot(premises_hypotheses=[
    ("Colleen was only biking to that library.", "Colleen was biking to that library."),
    ("Colleen was only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen was only biking to that library.", "Tanya was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen wasn't only biking to that library.", "Tanya was biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen was biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen wasn't biking to that library."),
    ("Was Colleen only biking to that library?", "Tanya was biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen was biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen wasn't biking to that library."),
    ("Colleen might have been only biking to that library.", "Tanya was biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen was biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen wasn't biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Tanya was biking to that library."),
    ("Colleen wasn't only biking to that library.", "Colleen was only biking to that library."),
    ("Was Colleen only biking to that library?", "Colleen was only biking to that library."),
    ("Colleen might have been only biking to that library.", "Colleen was only biking to that library."),
    ("If Colleen was only biking to that library, it's okay.", "Colleen was only biking to that library.")
])
print(ans.presupposes)

100%|██████████| 3/3 [01:33<00:00, 31.11s/it]


Bootstrapped 0 full traces after 2 examples for up to 3 rounds, amounting to 9 attempts.
['entailment', 'contradiction', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'contradiction', 'neutral', 'neutral', 'neutral']


In [39]:
from typing import Literal
from dspy.teleprompt import BootstrapFewShot
from collections import defaultdict


class Presupposition(dspy.Signature):
    """
    Identify whether the premise entails, contradicts, or is neutral with respect to the hypothesis.
    """
    premise: str = dspy.InputField(desc="A statement that is assumed to be true.")
    hypothesis: str = dspy.InputField(desc="The statement that is being evaluated in relation to the premise.")
    presupposes: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField(desc="The relationship between the premise and hypothesis, indicating whether the premise entails, contradicts, or is neutral with respect to the hypothesis.")


label_names = ['entailment', 'contradiction', 'neutral']

def validate_answer(example, pred, trace=None):
    """Validation function for DSPy optimization"""
    return example.presupposes == pred.presupposes

# Configure the teleprompter for optimization
teleprompter = BootstrapFewShot(
    metric=validate_answer,
    max_bootstrapped_demos=4,  # Number of examples to bootstrap
    max_labeled_demos=8,       # Maximum number of labeled demonstrations
    max_rounds=3               # Number of optimization rounds
)

def cut_data_round_robin(dataset, max_examples, complete_round=False):
    section_to_items = defaultdict(list)
    for premise_items in dataset:
        for item in premise_items:
            section_to_items[item['section']].append(item)

    subset = []
    i = 0
    added = True
    while added and len(subset) < max_examples:
        added = False
        for items in section_to_items.values():
            if not complete_round and len(subset) == max_examples:
                break
            added = True
            subset.append(items[i])
        i += 1
    return subset

def prepare_dspy_examples(max_examples: int = 20):
    """Round robin sampling of examples from diffrent sections of the training dataset."""
    subset = cut_data_round_robin(train_dataset, max_examples)
    return [dspy.Example(
        premise=item['premise'],
        hypothesis=item['hypothesis'],
        presupposes=label_names[item['gold_label']]
    ).with_inputs('premise', 'hypothesis') for item in subset]



class PresuppositionIncremental(dspy.Module):
    def __init__(self, callbacks=None, fewshot=False):
        super().__init__(callbacks)
        pres_tmp = dspy.Predict(Presupposition)
        if fewshot:
            self.pres = teleprompter.compile(pres_tmp, trainset=prepare_dspy_examples())
        else:
            self.pres = pres_tmp

    def forward(self, premises_hypotheses: list[tuple[str, str]]) -> PresuppositionParadigm:
        """
        Forward method to process a list of premises and hypotheses.
        """
        return PresuppositionParadigm(premises_hypotheses=premises_hypotheses, presupposes=[self.pres(premise=p, hypothesis=h).presupposes for p, h in premises_hypotheses])

# Example usage
pres_inc = PresuppositionIncremental(fewshot=False)
ans = pres_inc.forward(
    premises_hypotheses=[
        ("Colleen was only biking to that library.", "Colleen was biking to that library."),
        ("Colleen was only biking to that library.", "Colleen wasn't biking to that library."),
        ("Colleen was only biking to that library.", "Tanya was biking to that library."),
        ("Colleen wasn't only biking to that library.", "Colleen was biking to that library."),
        ("Colleen wasn't only biking to that library.", "Colleen wasn't biking to that library."),
        ("Colleen wasn't only biking to that library.", "Tanya was biking to that library."),
        ("Was Colleen only biking to that library?", "Colleen was biking to that library."),
        ("Was Colleen only biking to that library?", "Colleen wasn't biking to that library."),
        ("Was Colleen only biking to that library?", "Tanya was biking to that library."),
        ("Colleen might have been only biking to that library.", "Colleen was biking to that library."),
        ("Colleen might have been only biking to that library.", "Colleen wasn't biking to that library."),
        ("Colleen might have been only biking to that library.", "Tanya was biking to that library."),
        ("If Colleen was only biking to that library, it's okay.", "Colleen was biking to that library."),
        ("If Colleen was only biking to that library, it's okay.", "Colleen wasn't biking to that library."),
        ("If Colleen was only biking to that library, it's okay.", "Tanya was biking to that library."),
        ("Colleen wasn't only biking to that library.", "Colleen was only biking to that library."),
        ("Was Colleen only biking to that library?", "Colleen was only biking to that library."),
        ("Colleen might have been only biking to that library.", "Colleen was only biking to that library."),
        ("If Colleen was only biking to that library, it's okay.", "Colleen was only biking to that library.")
    ]
)
print(ans.presupposes)

['entailment', 'contradiction', 'neutral', 'entailment', 'contradiction', 'neutral', 'entailment', 'neutral', 'neutral', 'neutral', 'contradiction', 'neutral', 'neutral', 'neutral', 'neutral', 'contradiction', 'neutral', 'neutral', 'neutral']


## Evaluation

In [27]:
from tqdm import tqdm
def evaluate_on_dataset(dataset, program):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for premise_rows in tqdm(dataset):
        premises_hypotheses = [(row['premise'], row['hypothesis']) for row in premise_rows]
        prediction = program(premises_hypotheses=premises_hypotheses)
        for example, pred in zip(premise_rows, prediction.presupposes):
            results.append({
                'premise': example['premise'],
                'hypothesis':  example['hypothesis'],
                'pred_label': pred,
                'gold_label': label_names[example['gold_label']],
                'section': example['section']
            })
    return results

In [48]:
narrowed_test = cut_data_premise_round_robin(test_dataset, max_paradigms=9)
one_prompt_results = evaluate_on_dataset(narrowed_test, pres)
one_prompt_fewshot_results = evaluate_on_dataset(narrowed_test, pres_fewshot)
multiple_prompts_results = evaluate_on_dataset(narrowed_test, PresuppositionIncremental(fewshot=False))
multiple_prompts_fewshot_results = evaluate_on_dataset(narrowed_test, PresuppositionIncremental(fewshot=True))

100%|██████████| 9/9 [01:59<00:00, 13.23s/it]
100%|██████████| 9/9 [01:47<00:00, 11.90s/it]
100%|██████████| 9/9 [00:00<00:00, 131.00it/s]
 20%|██        | 4/20 [00:00<00:00, 1113.21it/s]


Bootstrapped 4 full traces after 4 examples for up to 3 rounds, amounting to 4 attempts.


100%|██████████| 9/9 [00:00<00:00, 69.64it/s]


In [49]:
from evaluate import load
from collections import defaultdict

accuracy = load("accuracy")
macro_f1 = load("f1")
macro_precision = load("precision")
macro_recall = load("recall")

preds = defaultdict(list)
refs = defaultdict(list)

model_classifications = []
for results in [one_prompt_results, one_prompt_fewshot_results, multiple_prompts_results, multiple_prompts_fewshot_results]:
    for res in results:
        preds[res['section']].append(label_names.index(res['pred_label']))
        refs[res['section']].append(label_names.index(res['gold_label']))


    classification_results = {}
    for section in preds:
        classification_results[section] = (
            accuracy.compute(predictions=preds[section], references=refs[section]) |
            macro_f1.compute(predictions=preds[section], references=refs[section], average='macro') |
            macro_precision.compute(predictions=preds[section], references=refs[section], average='macro') |
            macro_recall.compute(predictions=preds[section], references=refs[section], average='macro')
        )

    classification_results['total'] = (
            accuracy.compute(predictions=[p for section in preds.values() for p in section], 
                            references=[r for section in refs.values() for r in section]) |
            macro_f1.compute(predictions=[p for section in preds.values() for p in section], 
                            references=[r for section in refs.values() for r in section], average='macro') |
            macro_precision.compute(predictions=[p for section in preds.values() for p in section], 
                                    references=[r for section in refs.values() for r in section], average='macro') |
            macro_recall.compute(predictions=[p for section in preds.values() for p in section], 
                                references=[r for section in refs.values() for r in section], average='macro')
        )

    model_classifications.append(classification_results)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [50]:
import pandas as pd

dfs = []
for classification_results in model_classifications:
    df = pd.DataFrame(classification_results).T
    df = df.rename(columns={
        'accuracy': 'Accuracy',
        'f1': 'Macro F1',
        'precision': 'Macro Precision',
        'recall': 'Macro Recall'
    })
    df.index.name = 'Section'
    df.reset_index(inplace=True)
    dfs.append(df)

In [51]:
# Single prompt
dfs[0]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.736842,0.72619,0.871795,0.7
1,both_presupposition,0.473684,0.300366,0.481481,0.388889
2,change_of_state,0.526316,0.464803,0.655556,0.469444
3,cleft_existence,0.631579,0.565217,0.844444,0.566667
4,cleft_uniqueness,0.473684,0.300366,0.481481,0.388889
5,only_presupposition,0.578947,0.5,0.833333,0.511111
6,possessed_definites_existence,1.0,1.0,1.0,1.0
7,possessed_definites_uniqueness,0.368421,0.248485,0.22619,0.305556
8,question_presupposition,0.789474,0.783333,0.888889,0.755556
9,total,0.619883,0.573528,0.764833,0.565123


In [52]:
# Single prompt + Fewshot
dfs[1]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.631579,0.619264,0.747222,0.598611
1,both_presupposition,0.447368,0.446581,0.602657,0.426389
2,change_of_state,0.552632,0.481997,0.716846,0.490278
3,cleft_existence,0.631579,0.565217,0.844444,0.566667
4,cleft_uniqueness,0.473684,0.300366,0.481481,0.388889
5,only_presupposition,0.605263,0.53414,0.83871,0.538889
6,possessed_definites_existence,1.0,1.0,1.0,1.0
7,possessed_definites_uniqueness,0.394737,0.258581,0.239683,0.326389
8,question_presupposition,0.894737,0.895623,0.933333,0.877778
9,total,0.625731,0.593887,0.752742,0.579321


In [53]:
# Multiple prompts
dfs[2]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.754386,0.754902,0.822917,0.732407
1,both_presupposition,0.631579,0.641648,0.719833,0.617593
2,change_of_state,0.561404,0.494223,0.763121,0.500926
3,cleft_existence,0.666667,0.617252,0.852713,0.607407
4,cleft_uniqueness,0.473684,0.300366,0.481481,0.388889
5,only_presupposition,0.631579,0.583492,0.811628,0.575
6,possessed_definites_existence,1.0,1.0,1.0,1.0
7,possessed_definites_uniqueness,0.421053,0.271368,0.270833,0.347222
8,question_presupposition,0.894737,0.895623,0.933333,0.877778
9,total,0.670565,0.64822,0.793733,0.627469


In [54]:
# Multiple prompts + Fewshot
dfs[3]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.815789,0.818391,0.862879,0.799306
1,both_presupposition,0.710526,0.721058,0.773551,0.699306
2,change_of_state,0.592105,0.539427,0.794399,0.536806
3,cleft_existence,0.723684,0.697867,0.867925,0.675
4,cleft_uniqueness,0.473684,0.317557,0.480952,0.392361
5,only_presupposition,0.631579,0.587179,0.81705,0.575694
6,possessed_definites_existence,1.0,1.0,1.0,1.0
7,possessed_definites_uniqueness,0.513158,0.436389,0.605556,0.454861
8,question_presupposition,0.907895,0.907877,0.940171,0.891667
9,total,0.707602,0.693838,0.817575,0.669444


## Analysis

### Recap

In this study, we compared two main inference strategies for presupposition identification using the ImpPres dataset:

1. Single Prompt: All paradigm samples are provided to the LLM in one prompt.
2. Multiple Prompts: Each paradigm sample is presented separately in its own prompt, with results aggregated.

We evaluated both strategies with and without few-shot demonstrations.

### Findings

* Multiple prompts consistently outperformed the single prompt approach, suggesting that the model reasons more accurately and consistently when focused on one transformation at a time, rather than processing the entire paradigm jointly.
* Few-shot learning provided additional improvements, particularly when used with multiple prompts, by clarifying the labeling task and reducing ambiguity for difficult cases.
Some presupposition types remained challenging in all settings, indicating potential limitations in the model’s handling of certain linguistic transformations.

### Conclusion
Breaking paradigms into individual prompts with few-shot demonstrations yields the best overall performance and consistency. This approach minimizes confusion from sample interactions within a batch and leverages the LLM’s strengths in single-instance reasoning.