# Setup

Select the model to use as teacher and student during prompt optimization.

DSPy uses the litellm strings (eg. **ollama_chat/...**). 

For optimization, you would typically chose the stronger model as a teacher, for proposing instructions and generating bootstrapped samples. The student model is the model you intend to use during the task. These could also be the same model.

In [1]:
import os
from dotenv import load_dotenv
import dspy
import logging

logging.getLogger().setLevel(logging.INFO)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("LiteLLM").setLevel(logging.WARNING)

# set your api key (if needed)
load_dotenv("../../.env")
APIKEY = os.getenv("APIKEY")

# set your model (litellm model strings)
model_id = "openrouter/deepseek/deepseek-chat"
lm = dspy.LM(model_id, api_key=APIKEY, cache=True)
dspy.configure(lm=lm)

# Signatures

Signatures are like DSPy's pydantic models. Describe the fields and docstrings as though they are prompts (they are).

They will likely reflect the data in your table schema, but also could additional intermediate data structures in multi-hop patterns.

### Initial prototype
```python
from typing import Literal, Optional


class NewsAppSignatureExample(dspy.Signature):
    text: str = dspy.InputField(desc="Text from an article for analysis")
    category: Literal["world", "entertainment", "science", "health", "business", "sports", "politics", "tech"] = dspy.OutputField(desc="Article content category")
    title: str = dspy.OutputField(desc="Article title, when available. Otherwise create one")
    tags: list[str] = dspy.OutputField(desc="Tags for search and classification")
    notable_people: Optional[list[str]] = dspy.OutputField(desc="Names of notable people in the article")
    notable_organizations: Optional[list[str]] = dspy.OutputField(desc="Names of notable organizations in the article")


# system prompt goes in the docstring
NewsAppSignatureExample.__doc__ = """
You are provided with the text of a news article. Help provide the requested information for catalogging.
"""
```

With some good examples in hand, I refined an expanded list with ChatGPT.

In [2]:
from news_app import NewsAppSignature

# Run the program

I like the natural code style of writing a DSPy signature. A pydantic model becomes the prompt.

`Literal` type + LLM = classifier (cool!)

We can already try it out, using the ChainOfThought predictor to run the program.

In [7]:
text = """
Business Briefing Dec. 2, 2015
Nokia shareholders overwhelmingly approved the acquisition of the ailing French telecom Alcatel-Lucent, removing one of the last hurdles to a 15.6 billion euro ($16.5 billion) deal that will make Nokia a market leader in networks.
In October, Nokia said it would pay 4 billion to shareholders as the company raised its outlook for the year.
Rajeev Suri, Nokias chief executive, said he was delighted by shareholders recognizing the long-term value creation opportunity of the deal, which is expected to close during the first quarter of 2016.
"""

In [23]:
catalog = dspy.ChainOfThought(NewsAppSignature)
catalog_item = catalog(article_text=text)
print(catalog_item)

Prediction(
    reasoning="The article provides a business update on the acquisition of Alcatel-Lucent by Nokia, focusing on the approval from shareholders and the expected impact on the company's market position.",
    generated_title='Nokia Shareholders Approve Alcatel-Lucent Acquisition',
    publication_date=datetime.date(2015, 12, 2),
    primary_category='business',
    content_type='press_release',
    keywords=['Nokia', 'Alcatel-Lucent', 'acquisition', 'telecom', 'market leader', 'networks', 'shareholders', 'long-term value creation', 'first quarter of 2016'],
    mentioned_people=['Rajeev Suri'],
    mentioned_organizations=['Nokia', 'Alcatel-Lucent'],
    mentioned_legislation=None,
    mentioned_locations=None,
    sentiment_tone='positive',
    extracted_quotes=['Rajeev Suri, Nokias chief executive, said he was delighted by shareholders recognizing the long-term value creation opportunity of the deal.']
)


# Generating training data

We'll rely on "best of n" scaling to help create synthetic data for our application. Then we'll manually review ~100 examples we created for training.


## A basic test time scaling

I'll generate some training data using a simplistic best-of-n style test time scaling. Aggregating all of the types is a bit more challenging, so I've done that in the `aggregate/` folder as a module that I can work on further.

Depending on where you are running your LLM calls, you might choose the serial or parallel methods below.

In [5]:
from dspy import Parallel, ChainOfThought
from typing import List, Literal
import tqdm

def generate_candidates_serial(text, n=8):
    """ Run in serial """
    return [catalog(article_text=text) for _ in range(n)]


def generate_candidates_parallel(text, n=8, num_threads=2):
    """ Run in parallel """
    parallel_executor = dspy.Parallel(num_threads=num_threads)
    exec_pairs = [(catalog, {'article_text': text}) for _ in range(n)]
    results = parallel_executor.forward(exec_pairs)
    
    return results

### Aggregation

We need to aggregate by each field to obtain consensus results. For lists, we fuzzy deduplicate and then set a threshold for N minimum occurrences for acceptence. We are targeting aggregation from 8 outputs.

I've modularized the code and imported it here, since it's a bit long and not especially interesting.

In [6]:
import sys
sys.path.append("..")

from aggregate.aggregate import LLMOutputAggregator

```python
from typing import List, Optional, Literal, Dict, Any, Union
from collections import Counter
import textdistance
import itertools
from pydantic import ValidationError
from typing import get_origin, get_args, Union

def is_optional_field(type_hint) -> bool:
    """
    Determines if a type hint is Optional, i.e., Union[X, None].
    """
    return get_origin(type_hint) is Union and type(None) in get_args(type_hint)

def aggregate_signatures(text, predictions: List[Any], threshold: int = 2, debug: bool = False) -> NewsAppSignature:
    """
    Aggregates multiple Prediction objects into a single NewsAppSignature.
    
    Args:
        predictions (List[Any]): A list of Prediction objects.
        threshold (int): Minimum number of occurrences for a cluster to be included.
    
    Returns:
        NewsAppSignature: The aggregated signature.
    
    Raises:
        ValueError: If required fields are missing or validation fails.
    """
    if not predictions:
        raise ValueError("No predictions to aggregate.")

    aggregated_fields: Dict[str, Any] = {}
    
    # Helper function for majority voting
    def majority_vote(values: List[Any]) -> Any:
        counter = Counter(values)
        if debug:
            print(counter)
        most_common, count = counter.most_common(1)[0]
        return most_common

    # Helper function for clustering similar strings with frequency threshold
    def cluster_strings_with_threshold(strings: List[str], threshold: int, similarity_threshold: float = 0.6) -> List[str]:
        """
        Clusters similar strings based on Jaccard similarity and filters clusters based on frequency threshold.
        
        Args:
            strings (List[str]): List of strings to cluster.
            threshold (int): Minimum number of occurrences for a cluster to be included.
            similarity_threshold (float): Jaccard similarity threshold for clustering.
        
        Returns:
            List[str]: List of representative strings from clusters that meet the threshold.
        """
        clusters = []
        for string in strings:
            added = False
            for cluster in clusters:
                # Compare with the first item in the cluster
                similarity = textdistance.jaccard.normalized_similarity(
                    set(string.lower().split()), set(cluster[0].lower().split())
                )
                if similarity >= similarity_threshold:
                    cluster.append(string)
                    added = True
                    break
            if not added:
                clusters.append([string])
        
        # Filter clusters based on threshold
        if debug:
            print(clusters)
        filtered_clusters = [cluster for cluster in clusters if len(cluster) >= threshold]
        
        # Return one representative from each filtered cluster
        return [cluster[0] for cluster in filtered_clusters]
    
    # Iterate over each field in the NewsAppSignature
    for field_name, field_type in NewsAppSignature.__annotations__.items():
        # Special handling for 'article_text' since it's turned off
        if field_name == "article_text":
            # Set 'article_text' to an empty string as per user's instruction
            aggregated_fields[field_name] = text
            continue

        # Collect all non-None values for the current field
        field_values = [getattr(pred, field_name, None) for pred in predictions]
        field_values = [val for val in field_values if val is not None]

        if not field_values:
            # Determine if the field is optional
            if is_optional_field(field_type):
                aggregated_fields[field_name] = None
            else:
                # For required fields with no values, raise an error
                raise ValueError(f"No values found for required field '{field_name}' during aggregation.")
            continue

        # Determine the field type for aggregation
        origin_type = get_origin(field_type)
        args_type = get_args(field_type)

        # Handle Literal types
        if origin_type is Literal:
            # Majority voting for Literal fields
            aggregated_fields[field_name] = majority_vote(field_values)
        elif isinstance(field_values[0], str):
            # Majority voting for single-string fields
            aggregated_fields[field_name] = majority_vote(field_values)
        elif isinstance(field_values[0], list):
            # Flatten all lists
            flattened = list(itertools.chain.from_iterable(field_values))
            # Cluster similar strings with frequency threshold
            clustered = cluster_strings_with_threshold(flattened, threshold=threshold)
            aggregated_fields[field_name] = clustered
        else:
            # Handle other types if necessary
            aggregated_fields[field_name] = majority_vote(field_values)

    # Instantiate the aggregated NewsAppSignature with all fields
    try:
        aggregated_signature = NewsAppSignature(**aggregated_fields)
    except ValidationError as ve:
        # Extract detailed validation errors
        raise ValueError(f"Error creating aggregated NewsAppSignature: {ve}")

    return aggregated_signature
```

```python
consensus = aggregate_signatures(text, results, debug=True, threshold=4)
consensus
```

## Process a bunch of data

We can load `ag_news` to create our synthetic training data, and process ~100 rows.

I'll save the save the results as I go. Quick and dirty, just restart if it fails.

In [7]:
from datasets import load_dataset

# Load a diverse news dataset (e.g., "ag_news")
dataset = load_dataset("valurank/News_Articles_Categorization", split="train")

### Utilities for tracking the dataset offset

In [8]:
import json
import hashlib
import os
import tqdm

# Define the number of articles and samples
num_articles = 100
samples_per_article = 8

# Define the output directory
output_dir = "training_data"

# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Define a file to keep track of progress (offset)
progress_file = os.path.join(output_dir, "progress.txt")

# Function to generate a non-cryptographic hash (e.g., MD5) of a JSON string
def generate_hash(json_str: str) -> str:
    return hashlib.md5(json_str.encode('utf-8')).hexdigest()

# Function to load the current offset
def load_offset() -> int:
    if os.path.exists(progress_file):
        with open(progress_file, 'r') as f:
            try:
                offset = int(f.read().strip())
                return offset
            except ValueError:
                return 0
    return 0

# Function to save the current offset
def save_offset(offset: int):
    with open(progress_file, 'w') as f:
        f.write(str(offset))


### Best-Of-N Processing Loop

In [None]:
# Initialize the starting offset
start_offset = load_offset()

# Iterate over the specified number of articles starting from the offset
for i in tqdm.tqdm(range(start_offset, num_articles), desc="Processing Articles", total=num_articles - start_offset):
    try:
        article = dataset[i]
        text = article['Text']
        
        # Generate multiple predictions
        # candidates = generate_candidates_serial(text, n=samples_per_article)
        candidates = generate_candidates_parallel(text, n=samples_per_article)
        
        # Aggregate predictions to form consensus
        candidates_with_text = []
        for c in [c.toDict() for c in candidates]:
            c.update({"article_text": text})
            candidates_with_text.append(c)
        candidates_with_text
        consensus = LLMOutputAggregator.aggregate(
            NewsAppSignature, candidates_with_text, threshold=3
        )
        
        # Convert consensus to JSON string
        consensus_json = consensus.model_dump_json()
        
        # Generate filename using hash of JSON string
        filename_hash = generate_hash(consensus_json)
        filename = f"{filename_hash}.json"
        file_path = os.path.join(output_dir, filename)
        
        # Save the JSON string to the file
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(consensus_json)
        
        # Update the progress offset
        save_offset(i + 1)
        
    except Exception as e:
        print(f"Error processing article {i}: {e}")
        # Optionally, log the error to a file
        error_log = os.path.join(output_dir, "error_log.txt")
        with open(error_log, 'a') as f:
            f.write(f"Article {i}: {e}\n")
        # Continue with the next article
        continue


## Review data

(This is done in the review tool.)

## Load data

In [3]:
import glob
import json
from datetime import datetime

data = []
for file in glob.glob("training_data/accepted/*.json"):
    with open(file, "r") as fh:
        tmp = json.load(fh)

        # convert to date
        tmp["publication_date"] = datetime.strptime(tmp["publication_date"], "%Y-%m-%d").date() if tmp["publication_date"] else None

        # remove reasoning from example
        if "reasoning" in tmp:
            del tmp["reasoning"]

        e = dspy.Example(tmp).with_inputs("article_text")
        data.append(e)

In [4]:
data[0]

Example({'article_text': 'Elon Musk, Amber Heard Something\'s Fishy On Wrapped-Up Sushi Last we heard, Elon Musk and Amber Heard were "not back together" even though they kiss goodbye and dance real close ... sorry, we\'re not buying that now. Amber and Elon went on a sushi date Monday in WeHo, and looked like the full-blown hand-holding couple that\'s definitely on again. But that\'s only because that\'s exactly what they are -- no matter how many times they try to say they\'re not reunited. We broke the story ... Elon and Amber started hanging out again this past fall ... after announcing their split in the summer. Since then, they\'ve smooched and gone dancing together. If it looks like a reunited duck, walks like a reunited duck ... TMZ.com', 'generated_title': 'Elon Musk and Amber Heard Spark Reunion Rumors with Sushi Date in WeHo', 'publication_date': None, 'primary_category': 'entertainment', 'content_type': 'reporting', 'keywords': ['Elon Musk', 'Amber Heard', 'sushi date', 'We

In [5]:
import sys
sys.path.append("../scorer")
from scorer import WordLlamaScorer
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2


scorer = WordLlamaScorer.from_signature(NewsAppSignature, skip_fields=["article_text", "reasoning"])


teleprompter = MIPROv2(
    metric=scorer,N
    auto="medium",
    teacher_settings=dict(lm=teacher_lm),
    num_threads=2
)

catalog = dspy.ChainOfThought(NewsAppSignature)
optimized_program = teleprompter.compile(
    student=catalog.deepcopy(),
    teacher=catalog.deepcopy(),
    trainset=data,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    requires_permission_to_run=False,
)


2025/01/11 08:53:11 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 80

2025/01/11 08:53:11 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/01/11 08:53:11 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/01/11 08:53:11 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:01, 12.60it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 37.28it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 33.14it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 53.18it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 50.79it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 58.61it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 35.12it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 10/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 49.15it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 11/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 36.42it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 48.42it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 13/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 62.46it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 14/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 53.61it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 15/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 64.21it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 16/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 59.53it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 17/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 56.56it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 18/19


  5%|██████▌                                                                                                                            | 1/20 [00:00<00:00, 52.92it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 19/19


 10%|█████████████                                                                                                                      | 2/20 [00:00<00:00, 52.30it/s]
2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: 0: You are provided with the text of a news article.
Help provide the requested information for cataloging and retrieval.
Ensure information is focused on quality retrieval results -- accuracy, specificity, disambiguation.
Correct simple grammar or formatting mistakes.

2025/01/11 08:53:12 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are provided with the text of a news article and are tasked with categorizing and extracting relevant information to improve its discoverability and accuracy, including title, publication date, primary category, content type, keywords, mentioned people, organizations, legislation, locations, sentiment tone, and extracted quotes, while ensuring the output is accurate, specific, and disambiguated.

2025/01/11

Average Metric: 3.18 / 10 (31.8%):  12%|███████████▉                                                                                   | 10/80 [00:00<00:05, 12.96it/s]

2025/01/11 08:53:13 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'SharedElizabeth Wolf lives with her 81-year-old father and 65-year-old mother who both have dementia. Here, Elizabeth helps her mother, Nancy.Credit...Mark Makela for The New York TimesSlide 1 of 15 Elizabeth Wolf lives with her 81-year-old father and 65-year-old mother who both have dementia. Here, Elizabeth helps her mother, Nancy.Credit...Mark Makela for The New York TimesMarch 4, 2016In 2010, Elizabeth Wolf, then 30, was living in Vermont, working for a nonprofit and happily exploring new pursuits, from raising chickens to contra dancing.But after several disturbing phone calls from and about her parents, Louis and Nancy Brood, she moved back into the split-level in Mt. Laurel, N.J., where she and her siblings had grown up, with her now husband, Casey Wolf. She expected to arrange caregiving help for her parents, then return to Vermont. Five years later, she is still taking care of he

Average Metric: 11.18 / 33 (33.9%):  41%|██████████████████████████████████████▊                                                       | 33/80 [00:02<00:02, 16.37it/s]

2025/01/11 08:53:14 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Feb. 6, 2014Credit...Associated PressRalph Kiner, baseballs vastly undersung slugger, who belted more home runs than anyone else over his 10-year career but whose achievements in the batters box were obscured by his decades in the broadcast booth, where he was one of the games most recognizable personalities, died on Thursday at home in Rancho Mirage, Calif. He was 91.The Baseball Hall of Fame, which inducted him in 1975, announced the death.Baseball fans who are short of retirement, especially those in New York, are familiar with Kiner as an announcer who spent half a century with the Mets, enlivening their broadcasts with shrewd analysis, amiable storytelling and memorable malapropisms beginning with their woeful first season in 1962.His genial, well-informed and occasionally tongue-twisted presence accompanied all of Mets history, from the verbal high jinks of Casey Stengel and the fie

Average Metric: 18.09 / 57 (31.7%):  72%|████████████████████████████████████████████████████████████████████▏                         | 58/80 [00:03<00:01, 14.18it/s]

2025/01/11 08:53:16 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Buying PowerCredit...Left: Patrick McMullan Company; Right: Doug KuntzDec. 29, 2015WASHINGTON The hedge fund magnates Daniel S. Loeb, Louis Moore Bacon and Steven A. Cohen have much in common. They have managed billions of dollars in capital, earning vast fortunes. They have invested large sums in art and millions more in political candidates.Moreover, each has exploited an esoteric tax loophole that saved them millions in taxes. The trick? Route the money to Bermuda and back.With inequality at its highest levels in nearly a century and public debate rising over whether the government should respond to it through higher taxes on the wealthy, the very richest Americans have financed a sophisticated and astonishingly effective apparatus for shielding their fortunes. Some call it the income defense industry, consisting of a high-priced phalanx of lawyers, estate planners, lobbyists and anti-

Average Metric: 22.91 / 74 (31.0%):  96%|██████████████████████████████████████████████████████████████████████████████████████████▍   | 77/80 [00:05<00:00, 13.37it/s]

2025/01/11 08:53:17 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'VideotranscripttranscriptWho Is the Real Melania Trump? Who Knows?Melania Trump prizes her privacy. But the first ladys absence from the public eye has led to very different narratives about who she is.Melania Trump. Is she trapped? A reluctant first lady? Or is she poised, polished and a quiet, but supportive, first lady? Who is the real Melania? Very few, at least among those who are speaking publicly, will say. Im very strong. People, they dont really know me. People think and talk about me like, Oh, Melania. Oh, poor Melania. Dont feel sorry for me. Dont feel sorry for me. I can handle everything. But as Donald Trumps presidency has progressed, the first lady has become a versatile avatar, her image split sharply down political lines. In one version of the public imagination, Melania is unhappy. She dislikes her husband and disagrees with his policies. She may even be mulling a divorc

Average Metric: 23.55 / 76 (31.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:05<00:00, 14.72it/s]

2025/01/11 08:53:17 INFO dspy.evaluate.evaluate: Average Metric: 23.545454545454543 / 80 (29.4%)
2025/01/11 08:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 29.43

2025/01/11 08:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/01/11 08:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

2025/01/11 08:53:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 1 / 25 ==



Average Metric: 1.55 / 5 (30.9%):  20%|███████████████████▍                                                                             | 5/25 [00:00<00:01, 18.66it/s]

2025/01/11 08:53:18 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Feb. 6, 2014Credit...Associated PressRalph Kiner, baseballs vastly undersung slugger, who belted more home runs than anyone else over his 10-year career but whose achievements in the batters box were obscured by his decades in the broadcast booth, where he was one of the games most recognizable personalities, died on Thursday at home in Rancho Mirage, Calif. He was 91.The Baseball Hall of Fame, which inducted him in 1975, announced the death.Baseball fans who are short of retirement, especially those in New York, are familiar with Kiner as an announcer who spent half a century with the Mets, enlivening their broadcasts with shrewd analysis, amiable storytelling and memorable malapropisms beginning with their woeful first season in 1962.His genial, well-informed and occasionally tongue-twisted presence accompanied all of Mets history, from the verbal high jinks of Casey Stengel and the fie

Average Metric: 10.36 / 24 (43.2%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 28.21it/s]

2025/01/11 08:53:18 INFO dspy.evaluate.evaluate: Average Metric: 10.363636363636363 / 25 (41.5%)
2025/01/11 08:53:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 41.45 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].





2025/01/11 08:53:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45]
2025/01/11 08:53:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 2 / 25 ==


Average Metric: 10.00 / 25 (40.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 57.54it/s]

2025/01/11 08:53:19 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 25 (40.0%)
2025/01/11 08:53:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/01/11 08:53:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0]
2025/01/11 08:53:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 3 / 25 ==



Average Metric: 10.73 / 25 (42.9%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 40.33it/s]

2025/01/11 08:53:20 INFO dspy.evaluate.evaluate: Average Metric: 10.727272727272727 / 25 (42.9%)
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.91 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91]
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 4 / 25 ==



Average Metric: 11.36 / 25 (45.5%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 28.30it/s]

2025/01/11 08:53:20 INFO dspy.evaluate.evaluate: Average Metric: 11.363636363636363 / 25 (45.5%)
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.45 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].





2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45]
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 5 / 25 ==


Average Metric: 10.64 / 25 (42.5%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 27.44it/s]

2025/01/11 08:53:21 INFO dspy.evaluate.evaluate: Average Metric: 10.636363636363637 / 25 (42.5%)
2025/01/11 08:53:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.55 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:21 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55]
2025/01/11 08:53:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 6 / 25 ==



Average Metric: 9.82 / 25 (39.3%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 30.59it/s]


2025/01/11 08:53:22 INFO dspy.evaluate.evaluate: Average Metric: 9.818181818181818 / 25 (39.3%)
2025/01/11 08:53:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 39.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1'].
2025/01/11 08:53:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27]
2025/01/11 08:53:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:22 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 7 / 25 ==


Average Metric: 9.00 / 25 (36.0%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 22.42it/s]

2025/01/11 08:53:23 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 25 (36.0%)
2025/01/11 08:53:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 36.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12'].
2025/01/11 08:53:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0]
2025/01/11 08:53:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 8 / 25 ==



Average Metric: 6.91 / 23 (30.0%):  92%|███████████████████████████████████████████████████████████████████████████████████████▍       | 23/25 [00:00<00:00, 46.78it/s]

2025/01/11 08:53:24 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Feb. 6, 2014Credit...Associated PressRalph Kiner, baseballs vastly undersung slugger, who belted more home runs than anyone else over his 10-year career but whose achievements in the batters box were obscured by his decades in the broadcast booth, where he was one of the games most recognizable personalities, died on Thursday at home in Rancho Mirage, Calif. He was 91.The Baseball Hall of Fame, which inducted him in 1975, announced the death.Baseball fans who are short of retirement, especially those in New York, are familiar with Kiner as an announcer who spent half a century with the Mets, enlivening their broadcasts with shrewd analysis, amiable storytelling and memorable malapropisms beginning with their woeful first season in 1962.His genial, well-informed and occasionally tongue-twisted presence accompanied all of Mets history, from the verbal high jinks of Casey Stengel and the fie

Average Metric: 7.36 / 24 (30.7%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 49.35it/s]

2025/01/11 08:53:24 INFO dspy.evaluate.evaluate: Average Metric: 7.363636363636363 / 25 (29.5%)





2025/01/11 08:53:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.45 on minibatch of size 25 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13'].
2025/01/11 08:53:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45]
2025/01/11 08:53:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 9 / 25 ==


Average Metric: 10.27 / 25 (41.1%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 45.92it/s]

2025/01/11 08:53:25 INFO dspy.evaluate.evaluate: Average Metric: 10.272727272727273 / 25 (41.1%)
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 41.09 on minibatch of size 25 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].





2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09]
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 10 / 25 ==


Average Metric: 9.82 / 25 (39.3%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 45.16it/s]

2025/01/11 08:53:25 INFO dspy.evaluate.evaluate: Average Metric: 9.818181818181818 / 25 (39.3%)
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 39.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 1'].
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27]
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43]
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.43


2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Full Eval 1 =====
2025/01/11 08:53:25 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 45.45) from minibatch trials...



Average Metric: 33.73 / 80 (42.2%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:02<00:00, 35.34it/s]

2025/01/11 08:53:27 INFO dspy.evaluate.evaluate: Average Metric: 33.72727272727273 / 80 (42.2%)
2025/01/11 08:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 42.16
2025/01/11 08:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16
2025/01/11 08:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/01/11 08:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 11 / 25 ==



Average Metric: 9.45 / 25 (37.8%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 23.07it/s]

2025/01/11 08:53:29 INFO dspy.evaluate.evaluate: Average Metric: 9.454545454545455 / 25 (37.8%)
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.82 on minibatch of size 25 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 2'].





2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82]
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 12 / 25 ==


Average Metric: 9.82 / 25 (39.3%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 38.65it/s]

2025/01/11 08:53:29 INFO dspy.evaluate.evaluate: Average Metric: 9.818181818181818 / 25 (39.3%)
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 39.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27]
2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]





2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 13 / 25 ==


Average Metric: 8.45 / 25 (33.8%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 43.41it/s]

2025/01/11 08:53:30 INFO dspy.evaluate.evaluate: Average Metric: 8.454545454545455 / 25 (33.8%)
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.82 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 9'].
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82]
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 14 / 25 ==



Average Metric: 11.36 / 25 (45.5%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 46.21it/s]

2025/01/11 08:53:30 INFO dspy.evaluate.evaluate: Average Metric: 11.363636363636363 / 25 (45.5%)
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.45 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45]
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 15 / 25 ==



Average Metric: 2.55 / 7 (36.4%):  24%|███████████████████████▎                                                                         | 6/25 [00:00<00:00, 25.99it/s]

2025/01/11 08:53:31 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Feb. 6, 2014Credit...Associated PressRalph Kiner, baseballs vastly undersung slugger, who belted more home runs than anyone else over his 10-year career but whose achievements in the batters box were obscured by his decades in the broadcast booth, where he was one of the games most recognizable personalities, died on Thursday at home in Rancho Mirage, Calif. He was 91.The Baseball Hall of Fame, which inducted him in 1975, announced the death.Baseball fans who are short of retirement, especially those in New York, are familiar with Kiner as an announcer who spent half a century with the Mets, enlivening their broadcasts with shrewd analysis, amiable storytelling and memorable malapropisms beginning with their woeful first season in 1962.His genial, well-informed and occasionally tongue-twisted presence accompanied all of Mets history, from the verbal high jinks of Casey Stengel and the fie

Average Metric: 8.82 / 24 (36.7%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 31.95it/s]

2025/01/11 08:53:31 INFO dspy.evaluate.evaluate: Average Metric: 8.818181818181818 / 25 (35.3%)
2025/01/11 08:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 35.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 6'].
2025/01/11 08:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27]
2025/01/11 08:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 16 / 25 ==



Average Metric: 8.82 / 24 (36.7%):  96%|███████████████████████████████████████████████████████████████████████████████████████████▏   | 24/25 [00:00<00:00, 39.26it/s]

2025/01/11 08:53:32 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'Scientists are developing more than 100 coronavirus vaccines using a range of techniques, some of which are well-established and some of which have never been approved for medical use before. Most of these vaccines target the so-called spike proteins that cover the virus and help it invade human cells. The immune system can develop antibodies that latch onto spike proteins and stop the virus. A successful vaccine for the SARS-CoV-2 coronavirus would teach peoples immune systems to make antibodies against the virus without causing disease. Whole-Virus Vaccines Vaccines that modify the entire coronavirus to provoke an immune response. Inactivated and Live Attenuated Vaccines Most vaccines in use today incorporate an inactivated or weakened form of a virus that is not able to cause disease. When immune cells encounter them, they make antibodies. Making these vaccines means growing viruses an

Average Metric: 8.82 / 24 (36.7%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 37.00it/s]

2025/01/11 08:53:32 INFO dspy.evaluate.evaluate: Average Metric: 8.818181818181818 / 25 (35.3%)
2025/01/11 08:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 35.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 10'].
2025/01/11 08:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27]
2025/01/11 08:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]





2025/01/11 08:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 17 / 25 ==


Average Metric: 9.45 / 25 (37.8%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 22.59it/s]

2025/01/11 08:53:33 INFO dspy.evaluate.evaluate: Average Metric: 9.454545454545455 / 25 (37.8%)
2025/01/11 08:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.82 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 16'].
2025/01/11 08:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82]
2025/01/11 08:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 18 / 25 ==



Average Metric: 9.55 / 25 (38.2%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 34.12it/s]

2025/01/11 08:53:34 INFO dspy.evaluate.evaluate: Average Metric: 9.545454545454545 / 25 (38.2%)
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 38.18 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 11'].





2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18]
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 19 / 25 ==


Average Metric: 9.82 / 25 (39.3%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 56.46it/s]

2025/01/11 08:53:34 INFO dspy.evaluate.evaluate: Average Metric: 9.818181818181818 / 25 (39.3%)
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 39.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 3'].
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27]
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 20 / 25 ==



Average Metric: 10.73 / 25 (42.9%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 52.31it/s]

2025/01/11 08:53:35 INFO dspy.evaluate.evaluate: Average Metric: 10.727272727272727 / 25 (42.9%)
2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.91 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 15'].





2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91]
2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16]
2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Full Eval 2 =====
2025/01/11 08:53:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 45.45) from minibatch trials...


Average Metric: 33.09 / 80 (41.4%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:02<00:00, 38.99it/s]

2025/01/11 08:53:37 INFO dspy.evaluate.evaluate: Average Metric: 33.09090909090909 / 80 (41.4%)
2025/01/11 08:53:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16
2025/01/11 08:53:37 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/01/11 08:53:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 21 / 25 ==



Average Metric: 10.73 / 25 (42.9%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 46.70it/s]

2025/01/11 08:53:38 INFO dspy.evaluate.evaluate: Average Metric: 10.727272727272727 / 25 (42.9%)
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.91 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91, 42.91]
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16







2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 22 / 25 ==


Average Metric: 9.82 / 25 (39.3%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 32.95it/s]

2025/01/11 08:53:38 INFO dspy.evaluate.evaluate: Average Metric: 9.818181818181818 / 25 (39.3%)
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 39.27 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91, 42.91, 39.27]
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 23 / 25 ==



Average Metric: 11.73 / 25 (46.9%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 41.46it/s]

2025/01/11 08:53:39 INFO dspy.evaluate.evaluate: Average Metric: 11.727272727272727 / 25 (46.9%)
2025/01/11 08:53:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 46.91 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91, 42.91, 39.27, 46.91]
2025/01/11 08:53:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16







2025/01/11 08:53:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 24 / 25 ==


Average Metric: 11.18 / 25 (44.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 55.84it/s]

2025/01/11 08:53:40 INFO dspy.evaluate.evaluate: Average Metric: 11.181818181818182 / 25 (44.7%)
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 44.73 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91, 42.91, 39.27, 46.91, 44.73]
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Minibatch Trial 25 / 25 ==



Average Metric: 11.09 / 25 (44.4%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 39.48it/s]

2025/01/11 08:53:40 INFO dspy.evaluate.evaluate: Average Metric: 11.09090909090909 / 25 (44.4%)
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 44.36 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 18'].
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [41.45, 40.0, 42.91, 45.45, 42.55, 39.27, 36.0, 29.45, 41.09, 39.27, 37.82, 39.27, 33.82, 45.45, 35.27, 35.27, 37.82, 38.18, 39.27, 42.91, 42.91, 39.27, 46.91, 44.73, 44.36]
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36]
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16


2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Full Eval 3 =====
2025/01/11 08:53:40 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 45.635) from minibatch trials...



Average Metric: 21.64 / 47 (46.0%):  57%|██████████████████████████████████████████████████████                                        | 46/80 [00:01<00:00, 46.34it/s]

2025/01/11 08:53:41 ERROR dspy.utils.parallelizer: Error processing item Example({'article_text': 'VideotranscripttranscriptTrump Claims Justice Dept. Report Totally Exonerates MeSpeaking to reporters, President Trump answered questions about investigations, North Korea and immigration.Look, if you see what Ive done with North Korea, and with the State Department Mike Pompeo its running so well, I have this running so well. I have purposely, because of this ridiculous witch hunt, I have said, Im going to stay away from the Justice Department until its completed. So I wanted to stay away, now that doesnt mean I have to because I dont have to I can get involved. But I dont want you people to say that Im interfering, that Im doing anything. I think that the report yesterday maybe, more importantly than anything, it totally exonerates me. There was no collusion. There was no obstruction. And if you read the report, youll see that the But sir, the report What would you do with that? Why sir

Average Metric: 33.73 / 79 (42.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:02<00:00, 39.60it/s]

2025/01/11 08:53:42 INFO dspy.evaluate.evaluate: Average Metric: 33.72727272727273 / 80 (42.2%)
2025/01/11 08:53:42 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.43, 42.16, 41.36, 42.16]
2025/01/11 08:53:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 42.16
2025/01/11 08:53:42 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/01/11 08:53:42 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 42.16!





In [10]:
for demo in optimized_program.demos:
    if 'publication_date' in demo and isinstance(demo['publication_date'], date):
        demo['publication_date'] = demo['publication_date'].isoformat()

# Save the state to a JSON file
optimized_program.save("miprov2_command_r7b.json", save_program=False)


In [10]:
optimized_program(article_text=text)


Prediction(
    reasoning="This article reports on Nokia's acquisition of Alcatel-Lucent, a significant deal for the company and a move towards becoming a market leader in networks. The article provides a positive outlook on the deal, highlighting the approval of shareholders and the optimism of Nokia's CEO.",
    generated_title='Nokia Shares Soar as Alcatel-Lucent Acquisition Gets Greenlight',
    publication_date=datetime.date(2015, 12, 2),
    primary_category='business',
    content_type='reporting',
    keywords=['Nokia', 'Alcatel-Lucent', 'acquisition', 'telecom', 'networks', 'shareholders', 'stock market', 'Rajeev Suri'],
    mentioned_people=['Rajeev Suri'],
    mentioned_organizations=['Nokia', 'Alcatel-Lucent', 'Nokia Corporation'],
    mentioned_legislation=None,
    mentioned_locations=None,
    sentiment_tone='positive',
    extracted_quotes=None
)

# Extra

In [11]:
from wordllama import WordLlama

wl = WordLlama.load()

In [13]:
import glob
import json

data = []
for file in glob.glob("training_data/accepted/*.json"):
    with open(file, "r") as fh:
        data.append(json.load(fh))

In [16]:
texts = [x["article_text"] for x in data]

embeds = wl.embed(texts)
sim_matrix = wl.vector_similarity(embeds, embeds)

In [44]:
import numpy as np

idx = np.concatenate([x[:, None] for x in np.where(sim_matrix > 0.9)], axis=1)
deduplicate = list(filter(lambda x: x[0] < x[1], idx)) # lower triangle


```python
import os
from pathlib import Path

def remove_older_file(file1_path: str, file2_path: str) -> str:
    """
    Compare two files and remove the older one.
    Returns the path of the removed file.
    """
    # Get modification timestamps for both files
    time1 = os.path.getmtime(file1_path)
    time2 = os.path.getmtime(file2_path)
    
    # Compare and remove older file
    if time1 < time2:
        os.remove(file1_path)
        return file1_path
    else:
        os.remove(file2_path)
        return file2_path


files = list(glob.glob("training_data/accepted/*.json"))
for pair in deduplicate:
    older_file = remove_older_file(files[pair[0]], files[pair[1]])
    print(f"Removed older file: {older_file}")
```