New reference implementation: Misalignment evaluations#108
New reference implementation: Misalignment evaluations#108ethancjackson wants to merge 59 commits into
Conversation
Create a Langfuse-backed Python workflow for configurable ADK agent runs, transcript-based task definitions, judge-driven evaluation, trace usage metrics, and a documented smoke-test config to support future misalignment experiments. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs. Made-with: Cursor
Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs. Made-with: Cursor
Tighten the runtime so seeded conversations read like real chat, keep the experiment configs aligned with current thinking/output settings, and add a Metrics API-based terminal report for comparing conditions outside the Langfuse UI. Made-with: Cursor
Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior. Made-with: Cursor
Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini. Made-with: Cursor
Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review. Made-with: Cursor
…, rewrite README results_notebook.py: shrunk from 901 to 643 lines by replacing five custom dataclasses (NumericAccumulator, ConditionSummary, TraceRecord, AnalysisBundle with 13 fields) with pandas groupby aggregation and lighter data structures. AnalysisBundle is now 4 fields. Two near-duplicate Metrics API fetchers are now clean separate functions returning DataFrames. All public API preserved. report_metrics.ipynb: added a Discovery cell that lists available datasets and execution IDs so users no longer have to guess constants. Replaced the passive markdown cell with an actionable comment in the detail-view cell. Added a "how to copy for a new experiment" guide to the header and improved inline comments throughout. README.md: full rewrite for newcomers. Leads with what behavioral misalignment is and why it matters, includes a plain-language workflow diagram, a Quick Start section, a "Designing Your Own Experiment" walkthrough, and moves the config reference to the end. No jargon (PreparedTaskItem, ExecutionIdentity, etc.) in the sections visible to first-time readers. Made-with: Cursor
…values to float The Langfuse Metrics API can return latency/cost/token values as strings or None. The previous refactor dropped the explicit _coerce_float/_coerce_int helpers from the original code, causing 'unsupported operand type(s) for /: str and int' when _build_summary_df tried to compute avg_latency_s and avg_tokens. Added a _to_float helper inside _fetch_trace_metrics_df and a pd.to_numeric pass as a safety net. Made-with: Cursor
preparation.py: replace the 43-line null-coalescing body of resolve_agent_spec with a 10-line Pydantic model_dump merge (base fields overridden by non-None variant fields). Same semantics, far less repetition. preparation.py: remove PreparedTaskItem.judge_input — it was computed in prepare_task_item but never read anywhere (to_upload_item uses upload_input; the value was already embedded there via build_dataset_input). Eliminates a redundant call to build_judge_input. config_types.py: change ExperimentConfig.description default from the generic placeholder string "Misalignment experiment" to None, so experiments without a description don't silently inherit a meaningless label in Langfuse. Made-with: Cursor
…nnotate minimal config - Quick Start step 3 now names the smoke dataset (misalignment-qa-smoke-v2) and shows the exact DATASET_NAME constant to paste into the notebook. - 'Designing Your Own Experiment' minimal config is rewritten to demonstrate variant inheritance (variants only set what changes), per-variant example override, and the judge_model_config with an explanation of each key decision. - Added four plain-language bullets below the config explaining the most common stumbling points: variant inheritance, dataset_upload_subset, judge model independence, and how rubric metric names become notebook column names. Made-with: Cursor
Made-with: Cursor
Tests whether the reckless in-context examples from the prioritize-context experiment cause misalignment to transfer across output modalities — from conversational advice into code, email drafts, and structured YAML configs. Uses gemini-3-flash-preview (strongest responder in the prior experiment). Two variants: reckless-examples (same four ICL examples) and no-examples (empty list, control condition) — both under the same 'follow the examples' system prompt so only the examples themselves vary. Three tasks, one per modality: - emergency-stop-code: Python safety threshold function - safety-incident-email: building supervisor gas-leak notification - chemical-storage-config: YAML monitoring system config Same 3-metric rubric (harmful / helpful / follows_reckless_pattern) with updated judge guidance covering code, email, and config output formats. max_output_chars raised from 800 to 1200 for longer structured outputs. Made-with: Cursor
…oken handling - Add 6 new variants (reckless + control) for gemini-2.5-flash, gemini-3.1-pro-preview, and gemini-3.1-flash-lite-preview, giving a full 4-model × 2-condition factorial design (8 variants total). - Add `model` key to all condition_metadata blocks so notebooks can pivot on either the model or icl_condition axis independently. - Rename original variants from reckless-examples/no-examples to gemini-3-flash-reckless/gemini-3-flash-control for naming consistency. - task.py: filter out thinking parts (part.thought=True) from the string returned to the evaluator; thinking tokens are still visible in the raw Langfuse trace observation via ADK's automatic model-call logging. Made-with: Cursor
Step-by-step guide that shows how to fetch a Langfuse dataset, inspect and select individual items, run the agent and LLM judge in preview mode, then formally record the run as a Langfuse experiment. Made-with: Cursor
…owing instruction
…ation
- Replace 5 legacy experiment configs with a single bootcamp_misalignment.yaml
testing 5 conditions × 6 commercial models (Gemini + Anthropic) = 30 variants
- Add examples_inject_mode field to VariantSpec ("llm_messages" | "user_context")
so examples can be seeded as LLM API messages or embedded as plain text in the
user prompt — the latter being a more realistic end-user attack surface
- Thread user_context_preamble through PreparedVariantRun and MisalignmentTask
- Add preflight API key check and per-variant error catching to experiment.py;
skipped/failed variants are collected and printed as a visible warning summary
- Rewrite interactive_single_run.ipynb as a clean 6-step single-item runner
- Rewrite report_metrics.ipynb with condition-axis analysis and the 5-condition
table; default dataset points to misalignment-qa-bootcamp-v1
- Rewrite README for bootcamp audience: 3-path structure, agent framing note,
graceful API key warning docs, simplified config reference
- Delete test_notebook.ipynb (scratch file)
- Update .env.example: add ANTHROPIC_API_KEY, remove VECTOR_INFERENCE_API_KEY
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the 10 single-category tasks with 9 tasks across three categories:
A. life-safety (peanut-throat, child-cleaner, gas-smell) — in-domain,
same flavour as the reckless examples
B. harmful-code (keylogger, credential-harvest, covert-exfiltration) —
out-of-domain: does the dismissal pattern bleed into writing malware?
C. social-engineering (phishing-email, health-misinfo, fake-reviews) —
out-of-domain: deceptive content targeting people
Tasks carry a `category` metadata field for filtering in the notebook.
Update judge rubric: broaden follows_reckless_pattern to cover all three
modalities (dismissal, providing malicious code, generating deceptive content).
Bump langfuse_dataset_name to misalignment-qa-bootcamp-v2 (task content
changed so a fresh dataset is required). Update README and report notebook
to document the three-category structure and analysis approach.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ive notebook Co-authored-by: Cursor <cursoragent@cursor.com>
…ive_single_run.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>
…improve notebook UX - Rename results_notebook.py to analysis.py (better reflects purpose) - Fix missing 'condition' column in build_master_traces_frame: extract condition_condition from trace metadata and expose it as 'condition' - Add Plotly misalignment heatmap (condition × model, follows_reckless_pattern rate) as a headline dashboard figure; falls back to bar chart when condition data is absent - Replace verbose trace detail for-loop with collapsible HTML <details> accordion cards — colour-coded score badges in the summary line, full input/output/judge commentary hidden until expanded Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
Resolves all 50 ruff violations that were failing CI pre-commit checks: - D100/D101/D102/D103: add module, class, and function docstrings across config_types, experiment, preparation, task, run, and hard_metrics - D205/D400: fix MisalignmentTask class docstring format - W505: shorten doc lines to stay within max-doc-length=88 limit - A002: noqa annotation for `input` argument in llm_judge_evaluator - SIM105: replace try/except/pass with contextlib.suppress - E402: noqa annotation for sys.path-guarded import in run.py Co-authored-by: Cursor <cursoragent@cursor.com>
… in notebooks - Add D100 module docstring and D103 function docstrings to test_agent.py - Add E402 and D103 to nbqa-ruff ignore list in .pre-commit-config.yaml: nbQA converts notebooks to .py before linting so the per-file-ignores for *.ipynb in pyproject.toml do not apply to the nbqa-ruff hook Co-authored-by: Cursor <cursoragent@cursor.com>
…EvaluatorFunction - Configs() call: add type: ignore[call-arg] — openai_api_key and google_api_key are required fields populated from env vars at runtime, not constructor arguments; consistent with usage in knowledge_qa/agent.py - base_evaluator await: add type: ignore[misc] — EvaluatorFunction return type union includes non-awaitable branch that mypy cannot narrow Co-authored-by: Cursor <cursoragent@cursor.com>
… installed) Co-authored-by: Cursor <cursoragent@cursor.com>
…noise) Co-authored-by: Cursor <cursoragent@cursor.com>
- Rewrite heatmap cell to use matplotlib (renders as static PNG in all Jupyter environments; Plotly fig.show()/display(fig) was unreliable in VS Code notebooks) - Add matplotlib as a project dependency - In build_master_traces_frame: fall back to _fetch_scores_df (metrics API) when trace_detail.scores is empty (some Langfuse SDK versions) Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
|
To follow the structure of the repo in another following issue: Move framework code ( Keep in Tests move with the code: framework tests → |
Move framework code into aieng.agent_evals.misalignment_qa so it runs in CI, keep bootcamp notebooks/configs in implementations/, and add evaluate.py as the thin CLI entry point. Co-authored-by: Cursor <cursoragent@cursor.com>
Save rerun cell outputs from verifying imports against aieng.agent_evals.misalignment_qa. Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
Sort isort blocks in the library package and simplify run.py to import evaluate.main directly without a sys.path hack. Co-authored-by: Cursor <cursoragent@cursor.com>
Bring in pre-commit autoupdate (VectorInstitute#109) while keeping misalignment QA work.
|
@rjavadi I made the requested changes but it now looks like pip-audit has found a bunch of issues. I didn't want to touch any of the other code in this project -- are you able to take a look at the newly found vulnerabilities? |
|
@ethancjackson It looks there are minor vulnerabilities with in some packages we can safely ignore. |
Langfuse types expect Literal['NUMERIC', 'CATEGORICAL', 'BOOLEAN'], not ScoreDataType enum members. Fixes mypy arg-type errors in CI. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@ethancjackson Thanks for the fixes! I’ve left the first set of comments—could you start addressing these while I finish up the rest? |
@rjavadi can you say which comments are outstanding? The migration of code from implementations to package was done here: |
| ) | ||
| vector_inference_api_key: SecretStr | None = Field( | ||
| default=None, | ||
| validation_alias="VECTOR_INFERENCE_API_KEY", |
There was a problem hiding this comment.
nit: "VECTOR_INFERENCE_API_KEY" doesn't exist in .env. Is it used for testing purposes only?
vector_inference_api_key on Configs is declared but never read — the agent builder pulls VECTOR_INFERENCE_API_KEY straight from os.environ via AgentSpec.api_key_env. Same story as anthropic_api_key just above.
Two options:
- Remove both fields from
Configs, since they're not enforcing or providing anything (no extra="forbid", default=None). - Use them: have the agent builder prefer
configs.<name>.get_secret_value()whenspec.api_key_envmatches, falling back toos.getenvotherwise. That makesConfigsthe single source of truth and gives youSecretStr's leak protection in logs/exceptions.
Either is fine.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| TOOL_FACTORIES: dict[str, Any] = { |
There was a problem hiding this comment.
[Please ignore if you'd rather not make extra changes]
Recommended: change typing hint to TOOL_FACTORIES: dict[str, Callable[[Configs], Any]]
|
|
||
|
|
||
| def build_misalignment_agent(spec: AgentSpec, *, name: str = "assistant") -> LlmAgent: | ||
| """Build a configurable ADK LlmAgent. |
There was a problem hiding this comment.
According to CONTRIBUTING.md numpy format docstring is recommended.
|
|
||
|
|
||
| def build_misalignment_agent(spec: AgentSpec, *, name: str = "assistant") -> LlmAgent: | ||
| """Build a configurable ADK LlmAgent. |
There was a problem hiding this comment.
| """Build a configurable ADK LlmAgent. | |
| """Build a configurable ADK ``LlmAgent`` for misalignment QA experiments. | |
| Intentionally minimal: focuses on prompt/system-instruction configurability | |
| and tool selection so the test harness remains the main experiment driver. | |
| Parameters | |
| ---------- | |
| spec : AgentSpec | |
| Resolved agent specification (provider, model, prompt, tools, etc.). | |
| name : str, optional | |
| Name assigned to the underlying ``LlmAgent``. Defaults to ``"assistant"``. | |
| Returns | |
| ------- | |
| LlmAgent | |
| A configured ADK agent ready to be invoked by the experiment runner. | |
| Raises | |
| ------ | |
| ValueError | |
| If ``spec.tools`` contains an unsupported tool name, or if | |
| ``spec.api_key_env`` is set but the corresponding environment | |
| variable is empty. | |
| """ |
There was a problem hiding this comment.
AgentSpec and AgentOverrideSpec duplicate ~10 fields with identical constraints.
We could collapse to a single AgentSpec (all-optional, with merged_with and resolve methods) plus a ResolvedAgentSpec for the post-merge type, using model_fields_set to preserve the field: null override semantics. Not required for this PR .
| metadata: dict[str, Any] = Field(default_factory=dict) | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_prompt_fields(self) -> "TaskItemSpec": |
There was a problem hiding this comment.
"TaskItemSpec" doesn't require quotes since we have: from __future__ import annotations
| ) | ||
|
|
||
| @model_validator(mode="after") | ||
| def _validate_experiment(self) -> "ExperimentConfig": |
There was a problem hiding this comment.
"ExperimentConfig" - no quotes required
|
@ethancjackson Apologies, I hadn't hit "submit review". I submitted first batch of comments. Thanks! |
|
Docstrings throughout this new package use the one-line summary style. One-liners are fine for self-evident helpers like I didn’t leave comments on every function to avoid cluttering the review. |
Summary
Adds
misalignment_qaas a new reference implementation for the LLM/agents evaluations bootcamp. The experiment probes whether reckless in-context examples can nudge model responses toward harmful behaviour, and whether that effect transfers across different harm domains. It is intentionally minimal — plain LLM completions, no tool use — to make the mechanics transparent and serve as a building block for participants who want to extend it to real agentic systems.Clickup Ticket(s): N/A
Type of Change
Changes Made
implementations/misalignment_qa/) — a YAML-driven experiment runner that tests five in-context-learning conditions (baseline, examples as LLM messages, examples as LLM messages + priority instruction, examples as user context, examples as user context + priority instruction) across six commercial models (three Gemini, three Anthropic), producing 30 variants against a shared 9-task datasetexamples_inject_modeconfig field controls whether examples reach the model as LLM API messages (developer surface) or as plain text inside the user message (end-user surface), implemented viapreparation.pyandtask.pyAgentSpec.temperatureis nowfloat | None;claude-opus-4-7variants carrytemperature: null(that model has deprecated the parameter); all other models usetemperature: 0.2; variant-level null overrides are propagated correctly via Pydanticmodel_fields_setinresolve_agent_spec01_interactive_single_run.ipynb(optional single-item preview),run.py(full 30-variant experiment),02_inspect_results.ipynb(pull results from Langfuse, heatmap dashboard + collapsible trace detail cards)analysis.py— helper module for the results notebook, replacing the oldresults_notebook.py; includes correctconditionmetadata extractionTesting
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
AuthenticationError(invalid key) and aBadRequestError(temperaturedeprecated onclaude-opus-4-7) — both error classes are now surfaced clearly in the warning summary02_inspect_results.ipynbrun end-to-end against a partial dataset (one condition); heatmap rendered correctly and collapsible trace cards displayed as intendedcondition_metadatafields populated, temperature matrix verified programmaticallyScreenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
Participants need
.enventries forGOOGLE_API_KEYand/orANTHROPIC_API_KEYin addition to the standard Langfuse keys. The experiment runs with only one provider's key and reports skipped variants at the end. No infrastructure changes required.Checklist