Skip to content

[FEATURE] crewai[dspy] — Algorithmic prompt optimization via the existing LLM hooks #5818

@MoShiha

Description

@MoShiha

Feature Area

Core functionality

Is your feature request related to a an existing bug? Please link it here.

NA

Describe the solution you'd like

Problem

Developers building production CrewAI applications spend significant manual effort tuning role, goal, backstory, and task description fields by hand — iterating on prompts without a systematic method to measure improvement. This is the classic "prompt engineering treadmill": changes are based on intuition, results are hard to reproduce, and there is no objective signal for when a crew is actually better.

The community has been solving this with workarounds for over a year:

Every one of these workarounds monkey-patches the internal LLM call chain because there is no stable, documented seam to hook into — meaning they break on every CrewAI release.


Proposed Solution

Ship a crewai[dspy] optional extra that provides a DSPyOptimizer class — a thin adapter between CrewAI's existing infrastructure and DSPy's optimization algorithms (MIPROv2, BootstrapFewShot, GEPA, etc.).

Key insight: the infrastructure already exists

The LLM hooks system introduced in #1875 provides almost everything needed:

# crewai/hooks/llm_hooks.py — already shipped
from crewai.hooks.llm_hooks import (
    register_before_llm_call_hook,
    register_after_llm_call_hook,
    LLMCallHookContext,
)

# LLMCallHookContext already exposes:
# context.messages    — the full composed prompt (mutable in-place)
# context.agent       — the agent (role, goal, backstory, system_template)
# context.task        — the task (description, expected_output)
# context.crew        — the crew instance
# context.response    — the LLM's response (in after hooks)

A crewai[dspy] adapter would use these hooks to:

  1. During optimization runs: capture (messages, response) pairs and score them against a developer-provided metric function
  2. After convergence: write optimized instructions back to agent.role, agent.goal, agent.backstory, or agent.system_template
  3. At inference time: inject optimized few-shot examples into context.messages via a before hook

This is structurally identical to how crewai[mem0] plugs into the memory system — a framework-level optional extra, not a runtime tool.

Developer experience: before vs. after

Before (current workaround — breaks on CrewAI updates):

import dspy
from crewai import Crew, Agent, Task

# Monkey-patch the internal LLM method (version-coupled, fragile)
original_call = crew.agents[0].llm._call
def patched_call(prompt, **kwargs):
    response = original_call(prompt, **kwargs)
    dspy_module.update(prompt, response)
    return response
crew.agents[0].llm._call = patched_call

# Run optimizer separately with no awareness of crew structure
optimizer = dspy.MIPROv2(metric=my_metric)
# ... glue code to connect DSPy signatures to CrewAI agents ...

After (proposed crewai[dspy]):

from crewai import Crew, Agent, Task
from crewai.optimizers.dspy import DSPyOptimizer  # crewai[dspy]

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
)

def quality_metric(example, prediction) -> float:
    """Score output quality — any callable returning 0.0–1.0."""
    judge = dspy.ChainOfThought("output -> score: float")
    return float(judge(output=prediction.final_output).score)

optimizer = DSPyOptimizer(
    crew=crew,
    metric=quality_metric,
    algorithm="MIPROv2",           # or "BootstrapFewShot", "GEPA"
)

# Run optimization against labeled examples
result = optimizer.compile(trainset=my_examples, num_trials=20)

# Optimized crew — same interface, better prompts
optimized_crew = result.crew
optimized_crew.kickoff(inputs={"topic": "climate change"})

# Inspect what changed
print(result.score_delta)           # +0.18
print(result.optimized_instructions) # dict of agent_role -> new instructions

Implementation Scope

This proposal is scoped to three independently-reviewable PRs to stay within the contributing guide's size/XL threshold:

PR 1 — Stable read/write access to agent instructions (in core)

Confirm or add a documented, public path to read and write the effective instructions of an agent after construction. Today agent.role, agent.goal, agent.backstory, agent.system_template, and agent.prompt_template are all writable Pydantic fields, but this is undocumented as a public API.

Ask: Add a doc comment confirming these fields are stable and intended for programmatic rewrite, and add a helper agent.get_effective_system_prompt() -> str if one doesn't already exist.

Files: lib/crewai/src/crewai/agent/core.py
Size: < 50 lines

PR 2 — DSPyOptimizer as crewai[dspy] optional extra (in core)

Add lib/crewai/src/crewai/optimizers/dspy_optimizer.py and declare the optional extra:

# lib/crewai/pyproject.toml
[project.optional-dependencies]
dspy = ["dspy>=2.5,<3"]

The DSPyOptimizer class:

  • Registers before/after LLM call hooks during the optimization loop
  • Uses before_kickoff_callbacks / after_kickoff_callbacks on Crew to demarcate runs
  • Delegates to crew.train() mechanics for the outer training loop
  • Writes optimized instructions back via the documented agent fields from PR 1
  • Returns an OptimizationResult dataclass with crew, score_delta, optimized_instructions, version_id

Files: lib/crewai/src/crewai/optimizers/__init__.py, lib/crewai/src/crewai/optimizers/dspy_optimizer.py, lib/crewai/pyproject.toml
Size: ~300 lines

PR 3 — Example in crewai-examples

End-to-end notebook: email-drafting crew optimized with MIPROv2 against an LLM-judge metric. Adapts the working monkey-patch tutorial at Ronoh4/dspy_crewai_course into the clean API.


What This Is NOT

To be explicit about scope, given the history of related closed issues:

  • Not an "auto-improve" feature: DSPyOptimizer does not run automatically or connect to any hosted service. It is an offline, developer-invoked, local optimization loop — no different in kind from crew.train().
  • Not a replacement for manual prompt crafting: It is a tool for developers who want to measure and improve their crews against a metric they define.
  • Not a hosted prompt-management product: It stores optimized configs locally. It does not touch CrewAI Enterprise, AMP, or any cloud observability feature.
  • Not a new dependency in the default install: dspy is only installed when a developer explicitly runs pip install crewai[dspy].

Acceptance Criteria

  • pip install crewai[dspy] succeeds without errors
  • DSPyOptimizer(crew, metric).compile(trainset) runs an optimization loop and returns an OptimizationResult
  • The optimized crew returned by result.crew produces measurably better outputs on the training metric than the baseline
  • No change in behavior when crewai[dspy] is not installed (no import at module level)
  • The before/after LLM hook registration is cleaned up after compile() completes (no global state leak)
  • Works with any LLM provider supported by CrewAI (tested with at least OpenAI and Anthropic)
  • Example notebook runs end-to-end in crewai-examples

Additional Context

Prior art in CrewAI:

  • #1875 — DSPy-style callbacks accepted and shipped → the contribution shape this proposal follows
  • crew.train() / TaskEvaluator — the existing training loop this optimizer extends
  • crewai[mem0] — the optional-extra packaging pattern this follows

Prior art elsewhere:

Willingness to contribute: Yes — happy to submit PR 1 and PR 2 if the maintainer team signals interest in the approach. Would appreciate a comment confirming the proposed file locations and optional-extra name before starting.


Filed against: crewAIInc/crewAI main branch
Related: #1875, #3280, #3015
Label suggestion: feature-request, integration, Core functionality

Describe alternatives you've considered

Alternatives Considered

1. Leave it to the community (status quo)
The workaround ecosystem (courses, monkey-patch tutorials, third-party optimizers) handles it. Rejected: the workarounds break on every CrewAI release because they patch internals. A stable hook surface in core prevents this fragility — even if CrewAI never ships DSPyOptimizer itself, the hooks protect the community from churn.

2. Standalone crewai-dspy package (not in core)
Ship the adapter entirely outside the monorepo. Considered: this is viable and has precedent (LangChain's approach). Not preferred: the crewai[mem0] pattern (optional extra in core) gives tighter CI coupling — changes to hooks or agent internals are caught in the same test suite that tests the optimizer, rather than silently breaking a downstream package. The path for crewai-tools (external → absorbed into monorepo) suggests the maintainers prefer eventual consolidation.

3. Only add the hook documentation (PR 1 only)
Document the existing LLM hooks as the official stable seam without shipping an adapter. Also acceptable as a first step if the maintainer team prefers to let the community build the adapter first.

Additional context

No response

Willingness to Contribute

Yes, I'd be happy to submit a pull request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions