<div align="center">
  <a target="_blank" href="https://colab.research.google.com/github/crowdcent/centimators/blob/main/docs/tutorials/dspymator.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </a>
</div>

# DSPyMator Tutorial

This tutorial demonstrates how to use `DSPyMator`, a scikit-learn compatible wrapper that brings the power of Large Language Models (LLMs) to tabular prediction tasks through [DSPy](https://dspy.ai/). You'll learn to build a basic text classifier with DSPyMator, use `predict()` vs `transform()` to get different outputs, extract reasoning with `ChainOfThought`, optimize prompts automatically with GEPA and `fit()`, and build advanced pipelines with embeddings and dimensionality reduction

### Overview

`DSPyMator` wraps any DSPy module (e.g., `dspy.Predict`, `dspy.ChainOfThought`) and exposes it through the familiar scikit-learn API. It enables LLM-based predictions that work seamlessly with sklearn pipelines, cross-validation, and other ML tooling. Unlike traditional ML models that learn patterns through gradient descent, DSPyMator leverages pre-trained LLMs and natural language reasoning—making it ideal for tasks where domain knowledge, explainability, and few-shot learning are critical.

### Prerequisites

To run this tutorial, you'll need:

- An OpenAI API key **Warning:** This tutorial uses an LLM via an API (like OpenAI's). Running the code, especially the prompt optimization part, will make calls to this API and may incur costs.
- The `dspy` library for LLM orchestration
- The `datasets` library from Hugging Face for loading the Rotten Tomatoes dataset
- `cluestar` for interactive text visualization


In [1]:
# !pip install centimators[all] datasets cluestar

## 1. Load the Dataset

We'll use the **Rotten Tomatoes** dataset from Hugging Face - a popular movie review dataset for sentiment analysis. We'll load a subset for faster execution in this tutorial.


In [2]:
# import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import polars as pl
from datasets import load_dataset

# Load the Rotten Tomatoes movie review dataset from Hugging Face
print("Loading Rotten Tomatoes dataset...")
dataset = load_dataset("rotten_tomatoes")

# Convert to polars and take a subset for faster execution
# Using 300 samples for training and 50 for testing
train_data = dataset["train"].shuffle(seed=42).select(range(300))
test_data = dataset["test"].shuffle(seed=42).select(range(100))

# Convert to polars DataFrames
train_df = pl.DataFrame(
    {
        "review": train_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in train_data["label"]
        ],
    }
)

test_df = pl.DataFrame(
    {
        "review": test_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in test_data["label"]
        ],
    }
)

# Prepare X and y
X_train = train_df[["review"]]
y_train = train_df["sentiment"]

X_test = test_df[["review"]]
y_test = test_df["sentiment"]

print("\n✓ Dataset loaded successfully!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print("\nSentiment distribution (train):")
display(y_train.value_counts())
print("\nSample reviews:")
train_df.head(3)

Loading Rotten Tomatoes dataset...

✓ Dataset loaded successfully!
Training samples: 300
Test samples: 100

Sentiment distribution (train):


sentiment,count
str,u32
"""negative""",146
"""positive""",154



Sample reviews:


review,sentiment
str,str
""". . . plays like somebody spli…","""negative"""
"""michael moore has perfected th…","""positive"""
""". . . too gory to be a comedy …","""negative"""


## 2. Basic Usage: Sentiment Classification Predictions

Let's start with a simple sentiment classifier using `dspy.Predict`. We'll define a signature that maps review text to sentiment.


In [3]:
import dspy
from centimators.model_estimators import DSPyMator

# Define a DSPy program with input -> output signature
# Format: "input_field: type -> output_field: type"
sentiment_program = dspy.Predict("review: str -> sentiment: str")

# Create the DSPyMator estimator
classifier = DSPyMator(
    program=sentiment_program,
    target_names="sentiment",  # Which output field to use as prediction
)

classifier.fit(X_train, y_train)  # establishes LM configuration

# Predict sentiments
predictions = classifier.predict(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {"review": X_test["review"], "predicted": predictions, "actual": y_test}
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 379.11it/s]


Prediction Results:





review,predicted,actual
str,str,str
"""unpretentious , charming , qui…","""positive""","""positive"""
"""a film really has to be except…","""negative""","""negative"""
"""working from a surprisingly se…","""positive""","""positive"""
"""it may not be particularly inn…","""positive""","""positive"""
"""such a premise is ripe for all…","""negative""","""negative"""
…,…,…
"""ice age is the first computer-…","""negative""","""negative"""
"""there's no denying that burns …","""Positive""","""positive"""
"""it collapses when mr . taylor …","""negative""","""negative"""
"""there's a great deal of corny …","""positive""","""positive"""



Accuracy: 79.00%


## 3. Adding Reasoning with ChainOfThought

One of the most powerful features of LLMs is their ability to explain their reasoning. Let's use `dspy.ChainOfThought` to get not just predictions, but also the reasoning behind them.


In [4]:
# Wrap the same signature as before but with a ChainOfThought (adds reasoning step)
cot_program = dspy.ChainOfThought("review: str -> sentiment: str")

cot_classifier = DSPyMator(program=cot_program, target_names="sentiment")

# Fit and transform to get reasoning
outputs_with_reasoning = cot_classifier.fit_transform(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {
        "review": X_test["review"],
        "reasoning": outputs_with_reasoning["reasoning"],
        "predicted": outputs_with_reasoning["sentiment"],
        "actual": y_test,
    }
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 1715.87it/s]


Prediction Results:





review,reasoning,predicted,actual
str,str,str,str
"""unpretentious , charming , qui…","""This short review uses multipl…","""positive""","""positive"""
"""a film really has to be except…","""This review claims the film is…","""negative""","""negative"""
"""working from a surprisingly se…","""The reviewer describes the scr…","""positive""","""positive"""
"""it may not be particularly inn…","""The reviewer notes a lack of i…","""positive""","""positive"""
"""such a premise is ripe for all…","""The review expresses a negativ…","""negative""","""negative"""
…,…,…,…
"""ice age is the first computer-…","""The review criticizes the paci…","""negative""","""negative"""
"""there's no denying that burns …","""The review expresses praise an…","""positive""","""positive"""
"""it collapses when mr . taylor …","""The reviewer criticizes the sh…","""negative""","""negative"""
"""there's a great deal of corny …","""The reviewer acknowledges flaw…","""Positive""","""positive"""



Accuracy: 77.00%


## 4. Prompt Optimization with GEPA

DSPyMator supports automatic prompt optimization using DSPy optimizers. Let's use **GEPA** (Generalized Expectation-driven Prompt Adaptation) to automatically improve our prompts based on training data.

GEPA iteratively refines prompts by analyzing errors and generating better instructions.

WARNING!! Running a full GEPA optimization can require a significant number of API calls and credits

In [5]:
# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    GEPA-compatible metric that returns score and textual feedback.

    Args:
        gold: The ground truth example
        pred: The predicted output
        trace: Optional full program trace
        pred_name: Optional name of predictor being optimized
        pred_trace: Optional trace of specific predictor

    Returns:
        float score or dspy.Prediction(score=float, feedback=str)
    """
    y_pred = pred.sentiment
    y_true = gold.sentiment
    is_correct = y_pred == y_true
    score = 1.0 if is_correct else 0.0

    # If GEPA is requesting predictor-level feedback, provide rich guidance
    if pred_name:
        if is_correct:
            feedback = f"Correctly classified as {y_pred}."
        else:
            feedback = (
                f"Incorrect prediction. Predicted '{y_pred}' but actual was '{y_true}'. "
                f"Review text: '{gold.review}'"
            )

        # Add reasoning context if available
        if hasattr(pred, "reasoning"):
            feedback += f" Reasoning: {pred.reasoning}"

        return dspy.Prediction(score=score, feedback=feedback)

    return score


# Create a light/constrained GEPA optimizer for faster results and demo
gepa_optimizer = dspy.teleprompt.GEPA(
    metric=sentiment_metric,
    auto="light",
    reflection_minibatch_size=10,
    reflection_lm=dspy.LM(model="openai/gpt-5-nano", temperature=1.0, max_tokens=16000),
)

# Fit with optimization
print(
    "Starting GEPA Optimization (this may take a long time and cost a lot of money)..."
)
preoptimized_instructions = cot_classifier.signature_.instructions
cot_classifier.fit(
    X_train,
    y_train,
    optimizer=gepa_optimizer,
    validation_data=0.3,  # Use 30% of training data for validation
)

print("\n✓ Optimization complete!")

2025/11/11 15:34:11 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 740 metric calls of the program. This amounts to 2.47 full evals on the train+val set.
2025/11/11 15:34:11 INFO dspy.teleprompt.gepa.gepa: Using 90 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.


Starting GEPA Optimization (this may take a long time and cost a lot of money)...


GEPA Optimization:   0%|          | 0/740 [00:00<?, ?rollouts/s]2025/11/11 15:34:12 INFO dspy.evaluate.evaluate: Average Metric: 76.0 / 90 (84.4%)
2025/11/11 15:34:12 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.8444444444444444
GEPA Optimization:  12%|█▏        | 90/740 [00:01<00:08, 72.93rollouts/s]2025/11/11 15:34:12 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.8444444444444444


Average Metric: 6.00 / 10 (60.0%): 100%|██████████| 10/10 [00:00<00:00, 161.25it/s]

2025/11/11 15:34:12 INFO dspy.evaluate.evaluate: Average Metric: 6.0 / 10 (60.0%)
2025/11/11 15:34:12 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Proposed new text for predict: New task: Binary sentiment classification for film reviews.

Input
- A single field named review containing a text string that evaluates a film.

Output
- A single field named sentiment with value either "positive" or "negative" (lowercase only).
- Do not include any additional fields, reasoning, or explanations.

Guidelines
- Determine the overall sentiment of the review toward the film as a whole.
- If the review expresses clear praise or a positive overall assessment, output "positive".
- If the review expresses clear criticism or a negative overall assessment, output "negative".
- Do not use a "mixed" label; for this task the ground-truth labels are strictly positive or negative.
- Do not provide reasoning or justification—only the sentiment field.
- Keep to the exact field name and value format; do not add




2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 81.0 / 90 (90.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 1: New program is on the linear pareto front
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Full valset score for new program: 0.9
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Full train_val score for new program: 0.9
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Individual valset scores for new program: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0]
2025/11/11 15:34:13 INFO dspy.teleprompt.

Average Metric: 10.00 / 10 (100.0%): 100%|██████████| 10/10 [00:00<00:00, 158.55it/s]

2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 2: All subsample scores perfect. Skipping.
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Reflective mutation did not propose a new candidate
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Selected program 1 score: 0.9



Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 152.99it/s]

2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for predict: New task: Binary sentiment classification for film reviews.

Input
- A single field named review containing a text string that evaluates a film.

Output
- A single field named sentiment with value either "positive" or "negative" (lowercase only).
- Do not include any additional fields, reasoning, or explanations.

Data handling
- The review may include varying length, punctuation, sarcasm, quotes, or informal language.
- The sentiment should reflect the overall evaluation of the film as a whole, not just individual scenes or elements.

Guidelines
- If the review expresses a clear positive evaluation of the film, output "positive".
- If the review expresses a clear negative evaluation of the film, output "negative".
- Do not use a "mixed" label; for this task ground-truth labels are strictly positive or negative




2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 3: New subsample score is not better, skipping
GEPA Optimization:  31%|███       | 230/740 [00:02<00:05, 100.35rollouts/s]2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Selected program 1 score: 0.9


Average Metric: 10.00 / 10 (100.0%): 100%|██████████| 10/10 [00:00<00:00, 152.77it/s]

2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 4: All subsample scores perfect. Skipping.
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Reflective mutation did not propose a new candidate
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Selected program 1 score: 0.9



Average Metric: 7.00 / 10 (70.0%): 100%|██████████| 10/10 [00:00<00:00, 140.05it/s]

2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 7.0 / 10 (70.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Proposed new text for predict: Task: Binary sentiment classification for film reviews.

Input
- review: a single field containing an English text review of a film.

Output
- sentiment: a single field with value "positive" or "negative" (lowercase only). Do not include any other fields, text, or explanations.

Guidelines
- Determine the overall sentiment toward the film as a whole.
- If the review expresses clear praise or a positive overall assessment, output sentiment = "positive".
- If the review expresses clear criticism or a negative overall assessment, output sentiment = "negative".
- Do not use any "mixed" label; ground-truth labels are strictly positive or negative.
- Do not provide reasoning or justification; only the sentiment field.
- Exact formatting: sentiment: positive OR sentiment: negative (no quotes, lowercase).

Interpretati




2025/11/11 15:34:13 INFO dspy.evaluate.evaluate: Average Metric: 6.0 / 10 (60.0%)
2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 5: New subsample score is not better, skipping
GEPA Optimization:  35%|███▌      | 260/740 [00:02<00:04, 98.34rollouts/s] 2025/11/11 15:34:13 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Selected program 1 score: 0.9


Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 147.35it/s]

2025/11/11 15:34:14 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:14 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Proposed new text for predict: 
New task: Binary sentiment classification for film reviews.

Input
- A single field named review containing a text string that evaluates a film.

Output
- A single field named sentiment with value either "positive" or "negative" (lowercase only).
- Do not include any additional fields, reasoning, or explanations.

Core requirement
- Determine the overall sentiment toward the film as a whole. The ground truth for this task is strictly positive or negative; do not output a mixed or neutral label.

Output format rules
- Exactly one field: sentiment: "<positive|negative>"
- Use lowercase and include the value in quotes, e.g., sentiment: "positive".
- Do not add any other text, fields, or annotations.

Classification strategy (deterministic and reproducible)
- Use a simple polarity scoring approach to decide the fi


Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 133.00it/s]

2025/11/11 15:34:14 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:14 INFO dspy.teleprompt.gepa.gepa: Iteration 7: Proposed new text for predict: New task: Binary sentiment classification for film reviews.

Input
- A single field named review containing a text string that evaluates a film.

Output
- A single field named sentiment with value either "positive" or "negative" (lowercase only).
- Do not include any additional fields, reasoning, or explanations.

Guidelines
- Determine the overall sentiment of the review toward the film as a whole. Do not output a "mixed" label.
- Ground-truth labels are strictly positive or negative; if the review contains both praise and criticism, determine which sentiment dominates and label accordingly. If the review is truly ambiguous with no clear leaning, infer the dominant overall stance from the concluding tone or the strongest sentiment cues.
- Do not provide reasoning or justification—only the sentiment field.
- Ke


Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 141.98it/s]

2025/11/11 15:34:14 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:14 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Proposed new text for predict: New task: Binary sentiment classification for film reviews.

Input
- review: a single text field containing a film review (in English; may include punctuation, quotes, etc.).

Output
- sentiment: a single field with value either "positive" or "negative" (lowercase). Do not include any other fields, text, or explanations.

Task description
- Determine the overall sentiment toward the film as a whole.
- If the review expresses clear praise or a positive overall assessment, return "positive".
- If the review expresses clear criticism or a negative overall assessment, return "negative".
- Do not use a "mixed" label; the ground-truth labels are strictly positive or negative.
- Do not provide any reasoning, justification, or extra text; only the sentiment field.

Guidelines for decision
- Look for explicit positive c




2025/11/11 15:34:14 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:15 INFO dspy.evaluate.evaluate: Average Metric: 84.0 / 90 (93.3%)
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 8: New program is on the linear pareto front
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Full valset score for new program: 0.9333333333333333
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Full train_val score for new program: 0.9333333333333333
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Individual valset scores for new program: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.

Average Metric: 10.00 / 10 (100.0%): 100%|██████████| 10/10 [00:00<00:00, 161.23it/s]

2025/11/11 15:34:15 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 9: All subsample scores perfect. Skipping.
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 9: Reflective mutation did not propose a new candidate
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Selected program 2 score: 0.9333333333333333



Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 167.81it/s]

2025/11/11 15:34:15 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:15 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Proposed new text for predict: New instruction for binary sentiment classification of film reviews

Task
- Determine the overall sentiment toward the film described in a single English review.
- Output exactly one field:
  sentiment: "positive" or "negative" (all lowercase, no quotes beyond the field value).
- Do not include any other fields, text, or explanations.

Input
- review: A single English text review of a film. It may contain punctuation, quotes, references to acting, directing, plot, cinematography, etc.

Output format
- Only the line: sentiment: positive
  or: sentiment: negative
- Do not prepend, append, or include any reasoning, justification, or extraneous text.

Decision rules
1) Overall sentiment
   - If the review expresses clear praise or a positive overall assessment of the film, return "positive".
   - If the review exp




2025/11/11 15:34:15 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:16 INFO dspy.evaluate.evaluate: Average Metric: 85.0 / 90 (94.4%)
2025/11/11 15:34:16 INFO dspy.teleprompt.gepa.gepa: Iteration 10: New program is on the linear pareto front
2025/11/11 15:34:16 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Full valset score for new program: 0.9444444444444444
2025/11/11 15:34:16 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Full train_val score for new program: 0.9444444444444444
2025/11/11 15:34:16 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Individual valset scores for new program: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0

Average Metric: 9.00 / 10 (90.0%): 100%|██████████| 10/10 [00:00<00:00, 120.49it/s]

2025/11/11 15:34:16 INFO dspy.evaluate.evaluate: Average Metric: 9.0 / 10 (90.0%)
2025/11/11 15:34:16 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Proposed new text for predict: New instruction for binary sentiment classification of film reviews

Task
- Determine the overall sentiment toward the film described in a single English review.
- Output exactly one field:
  sentiment: "positive" or "negative" (all lowercase, with the value in quotes only as part of the field, but the output line must be sentiment: positive or sentiment: negative without extra quotes around the field value).
- Do not include any other fields, text, or explanations.

Input
- review: A single English text review of a film. It may contain punctuation, quotes, references to acting, directing, plot, cinematography, etc.

Output format
- Only the line:
  sentiment: positive
  or
  sentiment: negative
- Do not prepend, append, or include any reasoning, justification, or extraneous text.

Decision rules
1) Overall se




2025/11/11 15:34:16 INFO dspy.evaluate.evaluate: Average Metric: 10.0 / 10 (100.0%)
2025/11/11 15:34:17 INFO dspy.evaluate.evaluate: Average Metric: 84.0 / 90 (93.3%)
2025/11/11 15:34:17 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Full valset score for new program: 0.9333333333333333
2025/11/11 15:34:17 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Full train_val score for new program: 0.9333333333333333
2025/11/11 15:34:17 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Individual valset scores for new program: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
2025/11/11 15:34:17 INFO dspy.t

Average Metric: 7.00 / 10 (70.0%): 100%|██████████| 10/10 [00:00<00:00, 215.86it/s]

2025/11/11 15:34:17 INFO dspy.evaluate.evaluate: Average Metric: 7.0 / 10 (70.0%)
2025/11/11 15:34:17 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Proposed new text for predict: New task instruction: Binary sentiment classification for film reviews

Objective
- Given a single English-language film review, output a single field:
  sentiment: "positive" or "negative" (lowercase only).
- Do not include any other text, fields, or explanations.

Input
- review: a single text field containing a film review (English; may include punctuation, quotes, etc.).

Output format
- Exactly one line containing either:
  positive
  negative

Decision rules (how to decide)
- Determine the overall sentiment toward the film as a whole; do not output "mixed".
- If the review clearly praises or endorses the film, output "positive".
- If the review clearly criticizes or diminishes the film, output "negative".
- When the review contains both favorable and unfavorable elements, weigh the overall tone and choos




2025/11/11 15:34:18 INFO dspy.evaluate.evaluate: Average Metric: 85.0 / 90 (94.4%)
2025/11/11 15:34:18 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Full valset score for new program: 0.9444444444444444
2025/11/11 15:34:18 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Full train_val score for new program: 0.9444444444444444
2025/11/11 15:34:18 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Individual valset scores for new program: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
2025/11/11 15:34:18 INFO dspy.teleprompt.gepa.gepa: Iteration 12: New valset pareto front scores: [1.0, 1.0, 1.0, 1


✓ Optimization complete!





In [6]:
optimized_predictions = cot_classifier.predict(X_test)
optimized_results = pl.DataFrame(
    {"review": X_test["review"], "actual": y_test, "predicted": optimized_predictions}
)

optimized_accuracy = (
    optimized_results["actual"] == optimized_results["predicted"]
).sum() / len(optimized_results)

print(f"Preoptimized accuracy: {accuracy:.2%}")
print(f"Preoptimized instructions: {preoptimized_instructions} \n")
print(f"Optimized accuracy: {optimized_accuracy:.2%}")
print(
    f"Optimized instructions (first 750 characters): {cot_classifier.signature_.instructions[:750]}"
)

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 551.67it/s]


Preoptimized accuracy: 77.00%
Preoptimized instructions: Given the fields `review`, produce the fields `sentiment`. 

Optimized accuracy: 89.00%
Optimized instructions (first 750 characters): New instruction for binary sentiment classification of film reviews

Task
- Determine the overall sentiment toward the film described in a single English review.
- Output exactly one field:
  sentiment: "positive" or "negative" (all lowercase, no quotes beyond the field value).
- Do not include any other fields, text, or explanations.

Input
- review: A single English text review of a film. It may contain punctuation, quotes, references to acting, directing, plot, cinematography, etc.

Output format
- Only the line: sentiment: positive
  or: sentiment: negative
- Do not prepend, append, or include any reasoning, justification, or extraneous text.

Decision rules
1) Overall sentiment
   - If the review expresses clear praise or a positive ov


## 5. Compose DSPyMator with other feature transformers in scikit-learn pipelines

While `predict()` returns only the pre-specified `target_names`, `transform()` returns **all** output fields from the DSPy program. This is useful when you want access to intermediate outputs or additional fields, like reasoning traces.

### Why? A Pipeline Integration Example: From Text to Embeddings to Visualization
One of DSPyMator's strengths is its compatibility with scikit-learn pipelines. Let's build a pipeline that:

1. Uses DSPyMator to generate reasoning about each review
2. Embeds the reasoning using `EmbeddingTransformer`
3. Reduces dimensionality with `DimReducer`

This creates a feature extraction pipeline where LLM reasoning becomes structured numerical features.

In [7]:
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import EmbeddingTransformer, DimReducer

dspymator = DSPyMator(
    program=dspy.ChainOfThought("review: str -> sentiment: str"),
    target_names="sentiment",
)

# Create an embedder that embeds the reasoning text
embedder = EmbeddingTransformer(
    model="openai/text-embedding-3-small",
    feature_names=["reasoning"],  # Embed the reasoning field
)

# Create a dimensionality reducer
dim_reducer = DimReducer(
    method="umap",
    n_components=2,  # Reduce embeddings to 2D for visualization
)

# Build the pipeline
pipeline = make_pipeline(dspymator, embedder, dim_reducer)

print("Pipeline created:")
display(pipeline)

Pipeline created:


0,1,2
,steps,"[('dspymator', ...), ('embeddingtransformer', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,program,predict = Pre...ntiment}'}) ))
,target_names,'sentiment'
,feature_names,
,lm,'openai/gpt-5-nano'
,temperature,1.0
,max_tokens,16000
,use_async,True
,max_concurrent,50
,verbose,True

0,1,2
,model,'openai/text-embedding-3-small'
,feature_names,['reasoning']
,categorical_mapping,{}
,batch_size,200
,caching,True

0,1,2
,method,'umap'
,n_components,2
,feature_names,


In [8]:
# Run the full pipeline on a small subset (to save time and API costs)
print("\nRunning pipeline: DSPyMator → Embeddings → UMAP...\n")

# Fit and transform
reduced_features = pipeline.fit_transform(X_train, y_train)
print("\nFirst few rows:")
display(reduced_features.head())


Running pipeline: DSPyMator → Embeddings → UMAP...



DSPyMator predicting: 100%|██████████| 300/300 [00:00<00:00, 618.77it/s]



First few rows:


dim_0,dim_1
f32,f32
0.286903,5.589123
7.863194,4.594091
0.060558,5.437039
8.057485,5.643534
1.149364,4.60095


### Visualize the Reasoning Embeddings

Let's visualize how the LLM's reasoning clusters different sentiments in 2D space. By embedding and visualizing the reasoning of the LLM-classifier, we can actually see the decision boundary that has been created and inspect why certain examples have been classified incorrectly in the two clusters.


In [9]:
import cluestar

# Create an interactive visualization with cluestar
cluestar.plot_text(
    X=reduced_features,
    texts=X_train["review"].to_list(),
    color_array=y_train.to_list(),
)

## 6. Key Takeaways

In this tutorial, you learned:

✅ **Basic Usage**: How to wrap DSPy programs with `DSPyMator` for sklearn compatibility

✅ **Prediction Methods**: 
- `predict()` returns only target field(s)
- `transform()` returns all output fields (including reasoning)

✅ **Chain of Thought**: Using `dspy.ChainOfThought` to get explainable predictions

✅ **Optimization**: Leveraging GEPA to automatically improve prompts based on training data

✅ **Pipeline Integration**: Building end-to-end pipelines combining LLM reasoning, embeddings, and dimensionality reduction

### Next Steps

- Experiment with other DSPy optimizers like `MIPROv2` or `BootstrapFewShot`
- Try different LLM providers (Anthropic, local models, etc.)
- Combine DSPyMator with traditional ML models in ensemble pipelines
- Explore multi-output predictions for richer feature extraction

### Learn More

- [DSPyMator Documentation](https://centimators.readthedocs.io/)
- [DSPy Official Docs](https://dspy.ai/)
- [Centimators GitHub](https://github.com/crowdcent/centimators)
- [Hugging Face Datasets](https://huggingface.co/docs/datasets/)
