# DSPy RAG Experiment

Kurzer DSPy‑Durchlauf mit LiteLLM‑Proxy und dem bestehenden RAG‑Datensatz.

**Voraussetzung:** `dataset` ist bereits erzeugt (z. B. aus `01_rag_baseline.ipynb`).
Wenn nicht, lade/erzeuge ihn zuerst und führe dann dieses Notebook aus.


## Dataset erstellen (CSV + Qdrant)

Lädt Fragen/Antworten aus der CSV, holt Kontexte aus Qdrant und erzeugt ein `dataset`.


In [10]:
import pandas as pd
from pathlib import Path
from datasets import Dataset
from litellm_client import (
    load_llm_config,
    load_vectordb_config,
    get_qdrant_client,
    get_embeddings,
)

def _retrieve_contexts(question: str, k: int, client, collection_name: str, llm_cfg):
    query_emb = get_embeddings([question], llm_cfg, batch_size=1)[0]
    results = client.query_points(
        collection_name=collection_name,
        query=query_emb,
        limit=k,
    ).points
    return [res.payload.get('text', '') for res in results]

def build_eval_dataset(
    csv_path: str = '../GrundschutzKI_Fragen-Antworten-Fundstellen.csv',
    top_k: int = 5,
) -> Dataset:
    llm_cfg = load_llm_config()
    vec_cfg = load_vectordb_config()
    qdrant_client = get_qdrant_client(vec_cfg)
    collection_name = vec_cfg.collection or 'grundschutz_xml'

    df = pd.read_csv(Path(csv_path), sep=';', encoding='utf-8-sig')
    records = []

    for _, row in df.iterrows():
        question = row['Frage']
        ground_truth = row['Antwort']
        contexts = _retrieve_contexts(question, top_k, qdrant_client, collection_name, llm_cfg)

        records.append({
            'question': question,
            'contexts': contexts,
            'ground_truth': ground_truth,
        })

    return Dataset.from_list(records)

dataset = build_eval_dataset(top_k=5)
dataset


  return QdrantClient(url=url, api_key=cfg.api_key)


Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing embeddings 0 to 1 / 1
Processing

Dataset({
    features: ['question', 'contexts', 'ground_truth'],
    num_rows: 42
})

In [11]:
import dspy
from litellm_client import load_llm_config

llm_cfg = load_llm_config()

# LiteLLM‑Proxy (OpenAI‑kompatibel)
dspy_llm = dspy.LM(
    model=llm_cfg.model,
    api_base=llm_cfg.api_base,
    api_key=llm_cfg.api_key,
    temperature=0.2,
)

dspy.configure(lm=dspy_llm)


In [12]:
class RAGAnswer(dspy.Signature):
    """Answer using only the provided context."""
    question: str = dspy.InputField()
    context: str = dspy.InputField()
    response: str = dspy.OutputField(desc="Antwort auf Deutsch, kurz und präzise, maximal 2–3 Sätze.")

class RAGModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict(
            RAGAnswer,
            instructions="Antworte auf Deutsch, kurz und präzise, max. 2–3 Sätze. Nutze nur den Kontext.",
        )

    def forward(self, question, context):
        return self.predict(question=question, context=context)

rag = RAGModule()


In [15]:
# Beispielausgaben
for i in range(3):
    row = dataset[i]
    context = "\n\n".join(row["contexts"])
    pred = rag(question=row["question"], context=context)
    print(f"\n--- SAMPLE {i} ---")
    print("QUESTION:", row["question"])
    print("PREDICTED ANSWER:", pred.response)
    print("GROUND TRUTH:", row["ground_truth"])



--- SAMPLE 0 ---
QUESTION: Was ist der Unterschied zwischen Prozess- und Systembausteinen?
PREDICTED ANSWER: Prozess‑Bausteine beschreiben sicherheitsrelevante Vorgänge, organisatorische und betriebliche Maßnahmen und gelten in der Regel für den gesamten Informationsverbund oder große Teile davon. System‑Bausteine hingegen werden auf konkrete Zielobjekte wie Anwendungen, IT‑Systeme, Geräte oder Gebäude angewendet und behandeln deren spezifische Sicherheitsaspekte.
GROUND TRUTH: Prozess-Bausteine gelten in der Regel für sämtliche oder große Teile des Informationsverbunds gleichermaßen, System-Bausteine lassen sich in der Regel auf einzelne Objekte oder Gruppen von Objekten anwenden. Die Prozess- und System-Bausteine bestehen wiederum aus weiteren Teilschichten. In den Hinweisen zum Schichtenmodell und zur Modellierung wird beschrieben, wann ein einzelner Baustein sinnvollerweise eingesetzt werden soll und auf welche Zielobjekte er anzuwenden ist. 

--- SAMPLE 1 ---
QUESTION: Welche gru

## DSPy Optimizer (MIPROv2)

Optimiert die Prompt‑Instruktionen für das RAG‑Programm. Kann kostenintensiv sein.


In [None]:
import asyncio
from ragas.metrics.collections import AnswerCorrectness
from ragas.embeddings.litellm_provider import LiteLLMEmbeddings
from ragas.llms import llm_factory
import instructor, litellm

# RAGAS LLM
litellm.api_base = llm_cfg.api_base
litellm.api_key = llm_cfg.api_key
client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.MD_JSON)
ragas_llm = llm_factory(llm_cfg.model, client=client, adapter="litellm", model_args={"temperature": 0.2})

# Embeddings (für Similarity)
embeddings = LiteLLMEmbeddings(
    model=llm_cfg.embedding_model,
    api_key=llm_cfg.api_key,
    api_base=llm_cfg.api_base,
    encoding_format="float",
)

ac = AnswerCorrectness(llm=ragas_llm, embeddings=embeddings)

def ragas_ac_metric(example, pred):
    return asyncio.run(ac.ascore(
        user_input=example.question,
        response=pred.response,
        reference=example.response,
    ))



In [None]:
import dspy
from dspy.evaluate import SemanticF1

# DSPy Examples aus dem vorhandenen Dataset
examples = []
for row in dataset:
    context = "\n\n".join(row["contexts"])
    examples.append(
        dspy.Example(question=row["question"], context=context, response=row["ground_truth"])
            .with_inputs("question", "context")
    )


# einfache Splits
trainset = examples[: max(1, len(examples)//5)]
devset = examples[max(1, len(examples)//5):]

# metric = SemanticF1(decompositional=True)

# Optimizer (wenig Threads zum Start)
tp = dspy.MIPROv2(metric=ragas_ac_metric, auto='light', num_threads=4)
optimized_rag = tp.compile(rag, trainset=trainset)


2026/01/29 11:37:18 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: False
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 6

2026/01/29 11:37:18 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2026/01/29 11:37:18 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2026/01/29 11:37:18 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


100%|██████████| 2/2 [00:16<00:00,  8.13s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6


100%|██████████| 2/2 [00:00<00:00, 13.09it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/6


 50%|█████     | 1/2 [00:00<00:00, 22.81it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6


 50%|█████     | 1/2 [00:00<00:00, 23.78it/s]
2026/01/29 11:37:35 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2026/01/29 11:37:35 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2026/01/29 11:37:35 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...



Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


2026/01/29 11:38:03 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2026/01/29 11:38:03 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Answer using only the provided context.

2026/01/29 11:38:03 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Formuliere eine **kurze** (maximal 2 – 3 Sätze) **präzise** Antwort **auf Deutsch**, die **ausschließlich** die im bereitgestellten Kontext enthaltene Anforderung wiedergibt. Nutze die dort vorkommenden **normativen Formulierungen** (z. B. „MUSS“, „SOLLTEN“, „DÜRFEN“) und **verzichte** auf Zitate von Abschnitts‑ oder Bausteinnummern, Tabellen oder anderen Metadaten. Gib **nur** die relevante Anforderung wieder, ohne zusätzliche Erläuterungen oder eigene Inhalte.

2026/01/29 11:38:03 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Antworte **ausschließlich** auf Basis des angegebenen Kontextes und **nur** in deutscher Sprache. Formuliere die Antwort **präzise und kompakt**: maximal **zwei‑bis‑drei Sätze**, wobei du die zentra

Average Metric: 4.62 / 6 (77.1%): 100%|██████████| 6/6 [00:12<00:00,  2.08s/it]

2026/01/29 11:38:16 INFO dspy.evaluate.evaluate: Average Metric: 4.623809523809523 / 6 (77.1%)
2026/01/29 11:38:16 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 77.06

  sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
2026/01/29 11:38:16 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 10 =====



Average Metric: 4.66 / 6 (77.7%): 100%|██████████| 6/6 [00:18<00:00,  3.07s/it] 

2026/01/29 11:38:34 INFO dspy.evaluate.evaluate: Average Metric: 4.663216321632163 / 6 (77.7%)
2026/01/29 11:38:34 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 77.72
2026/01/29 11:38:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.72 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2026/01/29 11:38:34 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72]
2026/01/29 11:38:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 77.72


2026/01/29 11:38:34 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 10 =====



Average Metric: 4.70 / 6 (78.3%): 100%|██████████| 6/6 [00:14<00:00,  2.39s/it] 

2026/01/29 11:38:48 INFO dspy.evaluate.evaluate: Average Metric: 4.700757575757576 / 6 (78.3%)
2026/01/29 11:38:48 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 78.35
2026/01/29 11:38:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2026/01/29 11:38:48 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35]
2026/01/29 11:38:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.35


2026/01/29 11:38:48 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 10 =====



Average Metric: 4.66 / 6 (77.7%): 100%|██████████| 6/6 [00:00<00:00, 1712.78it/s]

2026/01/29 11:38:49 INFO dspy.evaluate.evaluate: Average Metric: 4.663216321632163 / 6 (77.7%)
2026/01/29 11:38:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.72 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2026/01/29 11:38:49 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72]
2026/01/29 11:38:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 78.35


2026/01/29 11:38:49 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 10 =====



Average Metric: 4.74 / 6 (79.0%): 100%|██████████| 6/6 [00:17<00:00,  2.97s/it] 

2026/01/29 11:39:06 INFO dspy.evaluate.evaluate: Average Metric: 4.742424242424242 / 6 (79.0%)
2026/01/29 11:39:06 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 79.04
2026/01/29 11:39:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.04 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2026/01/29 11:39:06 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04]
2026/01/29 11:39:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.04


2026/01/29 11:39:06 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 10 =====



Average Metric: 4.79 / 6 (79.8%): 100%|██████████| 6/6 [00:16<00:00,  2.79s/it] 

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.79047619047619 / 6 (79.8%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 79.84
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.84 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 10 =====



Average Metric: 4.70 / 6 (78.3%): 100%|██████████| 6/6 [00:00<00:00, 3361.27it/s] 

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.700757575757576 / 6 (78.3%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84, 78.35]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 10 =====



Average Metric: 4.74 / 6 (79.0%): 100%|██████████| 6/6 [00:00<00:00, 625.92it/s] 

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.742424242424242 / 6 (79.0%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.04 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84, 78.35, 79.04]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 10 =====



Average Metric: 4.66 / 6 (77.7%): 100%|██████████| 6/6 [00:00<00:00, 1926.79it/s]

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.663216321632163 / 6 (77.7%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.72 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84, 78.35, 79.04, 77.72]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 10 =====



Average Metric: 4.74 / 6 (79.0%): 100%|██████████| 6/6 [00:00<00:00, 3005.95it/s]

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.742424242424242 / 6 (79.0%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.04 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84, 78.35, 79.04, 77.72, 79.04]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 10 =====



Average Metric: 4.79 / 6 (79.8%): 100%|██████████| 6/6 [00:00<00:00, 3327.49it/s]

2026/01/29 11:39:23 INFO dspy.evaluate.evaluate: Average Metric: 4.79047619047619 / 6 (79.8%)
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.84 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [77.06, 77.72, 78.35, 77.72, 79.04, 79.84, 78.35, 79.04, 77.72, 79.04, 79.84]
2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 79.84


2026/01/29 11:39:23 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 79.84!





In [18]:
split = max(1, len(dataset) // 5)
dev_rows = list(dataset)[split:]
dspy_dev_answers = []
for row in dev_rows:
    context = "\n\n".join(row["contexts"])
    pred = optimized_rag(question=row["question"], context=context)
    dspy_dev_answers.append(pred.response)

# Neues Dataset fürs Scoring
from datasets import Dataset

dev_dataset = Dataset.from_dict({
    "question": [r["question"] for r in dev_rows],
    "contexts": [r["contexts"] for r in dev_rows],
    "ground_truth": [r["ground_truth"] for r in dev_rows],
    "answer": dspy_dev_answers,   # <- DSPy Antworten
})


## RAGAS Evaluation (DSPy Answers)


In [20]:
import asyncio
from ragas.llms import llm_factory
from ragas.embeddings.litellm_provider import LiteLLMEmbeddings
from ragas.metrics.collections import ContextPrecision, ContextRecall, Faithfulness, AnswerCorrectness
import instructor
import litellm

# RAGAS LLM (LiteLLM proxy)
litellm.api_base = llm_cfg.api_base
litellm.api_key = llm_cfg.api_key
client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.MD_JSON)
ragas_llm = llm_factory(llm_cfg.model, client=client, adapter='litellm', model_args={'temperature': 0.2})

embeddings = LiteLLMEmbeddings(
    model=llm_cfg.embedding_model,
    api_key=llm_cfg.api_key,
    api_base=llm_cfg.api_base,
    encoding_format='float',
)

scorers = {
    'context_precision': ContextPrecision(llm=ragas_llm),
    'context_recall': ContextRecall(llm=ragas_llm),
    'faithfulness': Faithfulness(llm=ragas_llm),
    'answer_correctness': AnswerCorrectness(llm=ragas_llm, embeddings=embeddings),
}

async def _score_row(row, sem):
    async with sem:
        return {
            'context_precision': (await scorers['context_precision'].ascore(
                user_input=row['question'],
                reference=row['ground_truth'],
                retrieved_contexts=row['contexts'],
            )).value,
            'context_recall': (await scorers['context_recall'].ascore(
                user_input=row['question'],
                reference=row['ground_truth'],
                retrieved_contexts=row['contexts'],
            )).value,
            'faithfulness': (await scorers['faithfulness'].ascore(
                user_input=row['question'],
                response=row['answer'],
                retrieved_contexts=row['contexts'],
            )).value,
            'answer_correctness': (await scorers['answer_correctness'].ascore(
                user_input=row['question'],
                response=row['answer'],
                reference=row['ground_truth'],
            )).value,
        }

async def score_dataset_batched(ds, batch_size=10, concurrency=5):
    sem = asyncio.Semaphore(concurrency)
    rows = list(ds)
    results = []
    for i in range(0, len(rows), batch_size):
        batch = rows[i : i + batch_size]
        tasks = [asyncio.create_task(_score_row(r, sem)) for r in batch]
        results.extend(await asyncio.gather(*tasks))
    return results

scores = await score_dataset_batched(dev_dataset, batch_size=16, concurrency=10)
stats = {
    k: {
        'avg': sum(s[k] for s in scores) / len(scores),
        'min': min(s[k] for s in scores),
        'max': max(s[k] for s in scores),
    }
    for k in scores[0].keys()
}
print(stats)


{'context_precision': {'avg': 0.9203022875520346, 'min': 0.4499999999775, 'max': 0.99999999998}, 'context_recall': {'avg': 0.9509803921568628, 'min': 0.5, 'max': 1.0}, 'faithfulness': {'avg': 0.6060677884207296, 'min': 0.0, 'max': 1.0}, 'answer_correctness': {'avg': 0.5572924247809161, 'min': 0.12253877440565702, 'max': 0.9855250378946347}}


## Ergebnis-DataFrame


In [23]:
import pandas as pd

# DSPy-Antworten erzeugen (optimized_rag falls vorhanden)
_rag_model = optimized_rag if 'optimized_rag' in globals() else rag
dspy_answers = []
for row in dataset:
    context = "\n\n".join(row['contexts'])
    pred = _rag_model(question=row['question'], context=context)
    dspy_answers.append(pred.response)


# DataFrame zusammenbauen
df = pd.DataFrame({
    'question': [r['question'] for r in dev_dataset],
    'contexts': ["\n\n".join(r['contexts']) for r in dev_dataset],
    'answer_dspy': dspy_dev_answers,
    'ground_truth': [r['ground_truth'] for r in dev_dataset],
    'context_precision': [s['context_precision'] for s in scores],
    'context_recall': [s['context_recall'] for s in scores],
    'faithfulness': [s['faithfulness'] for s in scores],
    'answer_correctness': [s['answer_correctness'] for s in scores],
})


df.head()


Unnamed: 0,question,contexts,answer_dspy,ground_truth,context_precision,context_recall,faithfulness,answer_correctness
0,Was ist bei der Auswahl eines externen Webhost...,"gement (B)\nFür Prozesse, die potenziell ausge...",Bei der Auswahl eines externen Webhosters muss...,Bei der Nutzung externer Webhosting-Dienste so...,1.0,1.0,0.0,0.981953
1,Wie sollten Fehlermeldungen auf einem Webserve...,lständig dargestellt oder Sicherheitsmechanism...,Fehlermeldungen dürfen weder Produkt‑ noch Ver...,Aus HTTP-Antworten und Fehlermeldungen dürfen ...,0.5,1.0,1.0,0.951677
2,Welche Maßnahmen sind bei erhöhtem Schutzbedar...,"\n...\n\nStrict-Transport-Security,\n\n...\n\n...",Bei erhöhtem Schutzbedarf sollten Sie eine Web...,Bei erhöhtem Schutzbedarf sollten Webserver re...,0.45,1.0,1.0,0.506447
3,Wer trägt die Verantwortung für Informationssi...,rungskatalog Cloud Computing (C5)“ Kriterien z...,Die Verantwortung für die Informationssicherhe...,Die Verantwortung für Informationssicherheit v...,1.0,1.0,1.0,0.722545
4,Welche Prozesse dürfen grundsätzlich ausgelage...,"gement (B)\nFür Prozesse, die potenziell ausge...",Grundsätzlich dürfen nur solche Prozesse ausge...,"Nur Prozesse, die risikoorientiert bewertet wu...",0.804167,1.0,0.0,0.967802
