# ImpPres LLM Baseline

You have to implement in this notebook a baseline for ImpPres classification using an LLM.
This baseline must be implemented using DSPy.



## Load Dataset

In [1]:
from datasets import load_from_disk

unified_pres = load_from_disk("unified_presupposition.hf")

train_dataset = []
test_dataset = []
for item in unified_pres:
    if item['paradigmID'] == 0:
        train_dataset.append(item)
    else:
        test_dataset.append(item)

## Implement DsPy Programs

In [2]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
import os
import dspy
from dotenv import load_dotenv

load_dotenv()

lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

In [3]:
from typing import Literal

class Presupposition(dspy.Signature):
    """
    Identify whether the premise entails, contradicts, or is neutral with respect to the hypothesis.
    """
    premise: str = dspy.InputField(desc="A statement that is assumed to be true.")
    hypothesis: str = dspy.InputField(desc="The statement that is being evaluated in relation to the premise.")
    presupposes: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField(desc="The relationship between the premise and hypothesis, indicating whether the premise entails, contradicts, or is neutral with respect to the hypothesis.")


In [4]:
pres = dspy.Predict(Presupposition)
ans = pres(premise="The guest had found John.", hypothesis="John used to be in an unknown location.")
print(ans.presupposes)

entailment


In [5]:
pres_cot = dspy.ChainOfThought(Presupposition)
ans = pres_cot(premise="The guest had found John.", hypothesis="John used to be in an unknown location.")
print(ans.reasoning)
print(ans.presupposes)

The premise states that "The guest had found John," which implies that John was previously in a location or state that was not known to the guest, as the act of finding someone typically involves discovering them after they were unknown or inaccessible. This directly supports the hypothesis that "John used to be in an unknown location," making the premise entail the hypothesis.
entailment


In [60]:
from dspy.teleprompt import BootstrapFewShot
from collections import defaultdict

label_names = ['entailment', 'contradiction', 'neutral']

def validate_answer(example, pred, trace=None):
    """Validation function for DSPy optimization"""
    return example.presupposes == pred.presupposes

# Configure the teleprompter for optimization
teleprompter = BootstrapFewShot(
    metric=validate_answer,
    max_bootstrapped_demos=4,  # Number of examples to bootstrap
    max_labeled_demos=8,       # Maximum number of labeled demonstrations
    max_rounds=3               # Number of optimization rounds
)

def cut_data_round_robin(dataset, max_examples, complete_round=False):
    section_to_items = defaultdict(list)
    for item in dataset:
        section_to_items[item['section']].append(item)

    subset = []
    i = 0
    added = True
    while added and len(subset) < max_examples:
        added = False
        for items in section_to_items.values():
            if not complete_round and len(subset) == max_examples:
                break
            added = True
            subset.append(items[i])
        i += 1
    return subset

def prepare_dspy_examples(max_examples: int = 20):
    """Round robin sampling of examples from diffrent sections of the training dataset."""
    subset = cut_data_round_robin(train_dataset, max_examples)
    return [dspy.Example(
        premise=item['premise'],
        hypothesis=item['hypothesis'],
        presupposes=label_names[item['gold_label']]
    ).with_inputs('premise', 'hypothesis') for item in subset]

pres_fewshot = teleprompter.compile(pres, trainset=prepare_dspy_examples())
ans = pres_fewshot(premise="The guest had found John.", hypothesis="John used to be in an unknown location.")
print(ans.presupposes)

 20%|██        | 4/20 [00:00<00:00, 859.31it/s]


Bootstrapped 4 full traces after 4 examples for up to 3 rounds, amounting to 4 attempts.
entailment


## Your Turn

Compute the classification metrics on the baseline LLM model on each test section of the ANLI dataset for samples that have a non-empty 'reason' field.

You also must show a comparison between the DeBERTa baseline model and this LLM baseline model. The comparison metric should compute the agreement between the two models:
* On how many samples they are both correct [Correct]
* On how many samples Model1 is correct and Model2 is incorrect [Correct1]
* On how many samples Model1 is incorrect and Model2 is correct [Correct2]
* On how many samples both are incorrect [Incorrect]

### Standalone Evaluation

In [None]:
from tqdm import tqdm
def evaluate_on_dataset(dataset, program):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = program(premise=premise, hypothesis=hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'reasoning': prediction.reasoning if hasattr(prediction, 'reasoning') else None,
            'pred_label': prediction.presupposes,
            'gold_label': label_names[example['gold_label']],
            'section': example['section']
        })
    return results

In [45]:
# Evaluation on a subset of the test dataset (3 paradigms and all triggers for each section) to save time and money
narrowed_test = cut_data_round_robin(test_dataset, 513, complete_round=True)
results_vanilla = evaluate_on_dataset(narrowed_test, pres)
results_cot = evaluate_on_dataset(narrowed_test, pres_cot)
results_fewshot = evaluate_on_dataset(narrowed_test, pres_fewshot)

100%|██████████| 513/513 [00:00<00:00, 3862.54it/s]
100%|██████████| 513/513 [00:00<00:00, 3807.90it/s]
100%|██████████| 513/513 [00:00<00:00, 1518.86it/s]


In [35]:
from evaluate import load
from collections import defaultdict

accuracy = load("accuracy")
macro_f1 = load("f1")
macro_precision = load("precision")
macro_recall = load("recall")

preds = defaultdict(list)
refs = defaultdict(list)

model_classifications = []
for results in [results_vanilla, results_cot, results_fewshot]:
    for res in results:
        preds[res['section']].append(label_names.index(res['pred_label']))
        refs[res['section']].append(label_names.index(res['gold_label']))


    classification_results = {}
    for section in preds:
        classification_results[section] = (
            accuracy.compute(predictions=preds[section], references=refs[section]) |
            macro_f1.compute(predictions=preds[section], references=refs[section], average='macro') |
            macro_precision.compute(predictions=preds[section], references=refs[section], average='macro') |
            macro_recall.compute(predictions=preds[section], references=refs[section], average='macro')
        )

    classification_results['total'] = (
            accuracy.compute(predictions=[p for section in preds.values() for p in section], 
                            references=[r for section in refs.values() for r in section]) |
            macro_f1.compute(predictions=[p for section in preds.values() for p in section], 
                            references=[r for section in refs.values() for r in section], average='macro') |
            macro_precision.compute(predictions=[p for section in preds.values() for p in section], 
                                    references=[r for section in refs.values() for r in section], average='macro') |
            macro_recall.compute(predictions=[p for section in preds.values() for p in section], 
                                references=[r for section in refs.values() for r in section], average='macro')
        )

    model_classifications.append(classification_results)

In [36]:
import pandas as pd

dfs = []
for classification_results in model_classifications:
    df = pd.DataFrame(classification_results).T
    df = df.rename(columns={
        'accuracy': 'Accuracy',
        'f1': 'Macro F1',
        'precision': 'Macro Precision',
        'recall': 'Macro Recall'
    })
    df.index.name = 'Section'
    df.reset_index(inplace=True)
    dfs.append(df)

In [37]:
# Vanilla
dfs[0]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.947368,0.94929,0.962963,0.940741
1,both_presupposition,0.982456,0.983673,0.986667,0.981481
2,change_of_state,0.561404,0.469504,0.697991,0.492593
3,cleft_existence,0.719298,0.683333,0.866667,0.666667
4,cleft_uniqueness,0.491228,0.344697,0.81761,0.411111
5,only_presupposition,0.614035,0.585417,0.758333,0.569444
6,possessed_definites_existence,0.982456,0.981703,0.986667,0.977778
7,possessed_definites_uniqueness,0.473684,0.30303,0.484277,0.388889
8,question_presupposition,0.912281,0.907748,0.942529,0.892593
9,total,0.74269,0.727621,0.853762,0.702366


In [38]:
# CoT
dfs[1]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.929825,0.931939,0.952381,0.92037
1,both_presupposition,0.964912,0.967059,0.974359,0.962963
2,change_of_state,0.561404,0.465102,0.690359,0.490741
3,cleft_existence,0.710526,0.668575,0.864198,0.655556
4,cleft_uniqueness,0.5,0.359344,0.819048,0.42037
5,only_presupposition,0.622807,0.592279,0.771158,0.576389
6,possessed_definites_existence,0.973684,0.972355,0.980392,0.966667
7,possessed_definites_uniqueness,0.473684,0.30169,0.482866,0.388889
8,question_presupposition,0.929825,0.925639,0.952381,0.912963
9,total,0.740741,0.724796,0.856436,0.699434


In [39]:
# Few-shot
dfs[2]

Unnamed: 0,Section,Accuracy,Macro F1,Macro Precision,Macro Recall
0,all_n_presupposition,0.953216,0.954781,0.966667,0.946914
1,both_presupposition,0.959064,0.961445,0.970464,0.95679
2,change_of_state,0.573099,0.485904,0.705995,0.504938
3,cleft_existence,0.736842,0.711111,0.871795,0.688889
4,cleft_uniqueness,0.508772,0.38762,0.764791,0.433642
5,only_presupposition,0.619883,0.590884,0.769902,0.573765
6,possessed_definites_existence,0.976608,0.976381,0.982456,0.971605
7,possessed_definites_uniqueness,0.561404,0.479327,0.805409,0.496296
8,question_presupposition,0.929825,0.92483,0.952381,0.912346
9,total,0.757635,0.746891,0.862468,0.720576


### Comparison to Baseline

In [40]:
def model_mistakes(results):
    mistakes = []
    for res in results:
        if res['pred_label'] != res['gold_label']:
            mistakes.append(res)
    return mistakes

mistakes_vanilla = model_mistakes(results_vanilla)
mistakes_cot = model_mistakes(results_cot)
mistakes_fewshot = model_mistakes(results_fewshot)

In [41]:
import json
with open('mistakes_vanilla.json', 'w') as f:
    json.dump(mistakes_vanilla, f, indent=2)
with open('mistakes_cot.json', 'w') as f:
    json.dump(mistakes_cot, f, indent=2)
with open('mistakes_fewshot.json', 'w') as f:
    json.dump(mistakes_fewshot, f, indent=2)

In [46]:
with open('imppres_deberta_mistakes.json', 'r') as f:
    mistakes_deberta = json.load(f)
narrowed_test_lookup = {(m['premise'], m['hypothesis'], m['section']) for m in narrowed_test}
narrowed_mistakes_deberta = [m for m in mistakes_deberta if (m['premise'], m['hypothesis'], m['section']) in narrowed_test_lookup]

In [59]:
def get_lookup(mistakes):
    """Create a lookup for mistakes based on premise, hypothesis, and section."""
    return {(m['premise'], m['hypothesis'], m['section']) for m in mistakes}

mistakes_vanilla_lookup = get_lookup(mistakes_vanilla)
mistakes_cot_lookup = get_lookup(mistakes_cot)
mistakes_fewshot_lookup = get_lookup(mistakes_fewshot)
mistakes_deberta_lookup = get_lookup(narrowed_mistakes_deberta)

for program in ['vanilla', 'cot', 'fewshot']:
    mistakes_prog_lookup = globals()[f'mistakes_{program}_lookup']
    agreement_matrix = [[0, 0], [0, 0]]
    for item in narrowed_test:
        key = (item['premise'], item['hypothesis'], item['section'])
        agreement_matrix[int(key in mistakes_deberta_lookup)][int(key in mistakes_prog_lookup)] += 1
    print(f"Agreement matrix for {program} vs DeBERTa:")
    print(pd.DataFrame(agreement_matrix, 
          index=['DeBERTa Correct', 'DeBERTa Mistakes'], 
          columns=[f'{program} Correct', f'{program} Mistakes']))
    print()

Agreement matrix for vanilla vs DeBERTa:
                  vanilla Correct  vanilla Mistakes
DeBERTa Correct               238                58
DeBERTa Mistakes              143                74

Agreement matrix for cot vs DeBERTa:
                  cot Correct  cot Mistakes
DeBERTa Correct           235            61
DeBERTa Mistakes          144            73

Agreement matrix for fewshot vs DeBERTa:
                  fewshot Correct  fewshot Mistakes
DeBERTa Correct               243                53
DeBERTa Mistakes              163                54

