## DSPy Demo
This notebook demonstrates the of DSPy framework for automatic prompt optimization.

In [1]:
%cd ../

/Users/jakubzovak/Desktop/dspy-intro


In [2]:
%load_ext autoreload
%autoreload 2

## Setup

In [28]:
import json
import os
import re
from pathlib import Path
from typing import Literal, Any, cast

import dspy
from dotenv import load_dotenv
from dspy.adapters.baml_adapter import BAMLAdapter
from pydantic import BaseModel, Field
from sklearn.model_selection import train_test_split

from notebooks.utils.dtos import Merger, Acquisition, Other
from notebooks.utils.evaluation import validate_answer
from notebooks.utils.helpers import extract_first_n_sentences, read_data, get_gt_pydantic_model


Load environment with MlFlow and OpenAI keys.

In [4]:
load_dotenv()

True

## DSPy

### Adapters
Adapters are basically middleware for communication with LLM. They support different way of interacting with LLM such as structured output.

In [5]:
predictor_lm = dspy.LM(
    "openai/gpt-4.1-mini",
    max_tokens=16_000,
)

dspy.configure(lm=predictor_lm, adapter=BAMLAdapter()) # type: ignore

### Signatures
Signatures are declarative specification of the desired behavior that we expect from the LLM.

In [6]:
class ClassifyArticle(dspy.Signature):
    """
    Analyze the news article and classify it as a merger or acquisition deal.
    - A merger is when two companies combine to form a new entity.
    - An acquisition is when one company takes over another.
    - If it mentions a potential, rumored or failed deal, classify it as "Other".
    """

    text: str = dspy.InputField()
    article_type: Literal["merger", "acquisition", "other"] = dspy.OutputField()


class ExtractMergerInfo(dspy.Signature):
    """
    Extract information about companies involved in a merger deal.
    - If the currency symbol is just "$", assume USD.
    """

    text: str = dspy.InputField()
    merger_info: Merger = dspy.OutputField()


class ExtractAcquisitionInfo(dspy.Signature):
    """
    Extract information about companies involved in an acquisition deal.
    - If the currency symbol is just "$", assume USD.
    """

    text: str = dspy.InputField()
    acquisition_info: Acquisition = dspy.OutputField()

### Modules
Modules are core building block in DSPy that combine signatures and logic. Modules are composable and therefore can be combined to form more complex systems.

In [None]:
class Extract(dspy.Module):
    num_sentences: int = 5

    def __init__(self):
        self.classifier = dspy.Predict(ClassifyArticle)
        self.merger_extractor = dspy.Predict(ExtractMergerInfo)
        self.acquisition_extractor = dspy.Predict(ExtractAcquisitionInfo)

    def classify(self, text: str) -> str:
        text = extract_first_n_sentences(text, num_sentences=self.num_sentences)

        result = self.classifier(text=text)
        article_type : str = result.article_type # type: ignore

        return article_type


    def forward(self, text: str, article_id: int) -> Merger | Acquisition | Other:
        article_type = self.classify(text)

        if article_type == "merger":
            extracted_result = self.merger_extractor(text=text)
            merger_info : Merger = extracted_result.merger_info
            merger_info.article_id = article_id
            return merger_info
        
        elif article_type == "acquisition":
            extracted_result = self.acquisition_extractor(text=text)
            acquisition_info : Acquisition = extracted_result.acquisition_info
            acquisition_info.article_id = article_id
            return acquisition_info
        else:
            return Other(article_id=article_id, article_type="other")

## Data Preparation
Load articles about gold mining mergers and acquisitions.

In [8]:
articles = read_data(Path("data/articles.json"))
print(f"Example of loaded article with id {articles[0]['id']}: \n{json.dumps(articles[0], indent=2)}")

Example of loaded article with id 1: 
{
  "id": 1,
  "url": "https://www.australianmining.com.au/elevra-lithium-a-defining-moment-in-north-american-lithium/",
  "title": "Elevra: \u2018A defining moment in North American lithium\u2019",
  "date": "2025-04-09",
  "content": "The company to be created through the merger of ASX-listed Sayona Mining and Piedmont Lithium will be named Elevra Lithium, subject to Sayona shareholder approval. Originally announced in November 2024, the merger is set to create the largest hard-rock lithium producer in the US. The transaction will result in an approximate 50:50 equity holding for Sayona and Piedmont shareholders on a fully diluted basis following the deal\u2019s closing, which is expected in the first half of 2025. Now the merged entity has been named, nominees for the board have been announced. The Elevra Lithium board will initially comprise eight members, including four directors to be appointed by Sayona and four directors to be appointed by 

Load ground truth about the articles which contains their classification and information extraction.

In [9]:
ground_truths: list[dict[str, Any]] = read_data(Path("data/ground_truth.json"))
print(f"Example of loaded ground truth with id {ground_truths[0]['article_id']}: \n{json.dumps(ground_truths[0], indent=2)}")


Example of loaded ground truth with id 1: 
{
  "article_id": 1,
  "company_1": "Sayona Mining",
  "company_1_ticker": null,
  "company_2": "Piedmont Lithium",
  "company_2_ticker": null,
  "merged_entity": "Elevra Lithium",
  "deal_amount": null,
  "deal_currency": "Unknown",
  "article_type": "merger"
}


Convert data to DSPy examples.

In [10]:
NUM_SENTENCES = 5
dspy_examples: list[dspy.Example] = []
for article  in articles:
    ground_truth: dict[str, Any] = ground_truths[article["id"] - 1] # type: ignore
    gt_pydantic_model = get_gt_pydantic_model(ground_truth)
    example = dspy.Example(
        text=(
            article["title"] + "\n" + extract_first_n_sentences(article["content"], NUM_SENTENCES)
        ), 
        article_id=article["id"], 
        expected_output=gt_pydantic_model
    ).with_inputs("text", "article_id") # type: ignore
    dspy_examples.append(example)

Perform train-test split. For DSPy we actually do not need lot of training data.  <br>
In fact, using something **around 20-80 train-test split is quite standard**.

In [11]:
train_set, test_set = cast(
    tuple[list[dspy.Example], list[dspy.Example]],
    train_test_split(dspy_examples, test_size=0.75, random_state=39)
)

print(f"Train set size: {len(train_set)} \nTest set size:  {len(test_set)}")

Train set size: 10 
Test set size:  30


## Run Baseline DSPy Program

Instantiate and test out baseline DSPy program without any optimization.

In [12]:
extract_baseline = Extract()

Execute the baseline module.

In [17]:
extracted_article_1 = extract_baseline(text=train_set[0].text, article_id=train_set[0].article_id)

In [22]:
print(f"Extracted article:\n{extracted_article_1.model_dump_json(indent=2)}")

Extracted article:
{
  "article_id": 31,
  "company_1": "Baltic Iron Ore AB",
  "company_1_ticker": null,
  "company_2": "FerroNordic Resources ASA",
  "company_2_ticker": null,
  "merged_entity": null,
  "deal_amount": null,
  "deal_currency": "Unknown",
  "article_type": "merger"
}


Explore the underlying prompt. With `n=1` you will see extraction prompt. While with `n=2` you will see classification prompt.

In [None]:
dspy.inspect_history(n=1)

Store the baseline extractor as DSPy program in JSON format.

In [26]:
extract_baseline.save("programs/baseline_extractor.json")

## Optimizations

### LabelFewShot Optimizer

In [29]:
label_fewshot_optimizer = dspy.BootstrapFewShot(
    metric=validate_answer,
    max_bootstrapped_demos=0,
    max_labeled_demos=5,
    max_rounds=1,
    max_errors=10,
)

In [None]:
optimized_extract = label_fewshot_optimizer.compile(label_fewshot_optimizer, trainset=train_set)