New reference implementation: Misalignment evaluations by ethancjackson · Pull Request #108 · VectorInstitute/eval-agents

ethancjackson · 2026-05-14T18:02:47Z

Summary

Adds misalignment_qa as a new reference implementation for the LLM/agents evaluations bootcamp. The experiment probes whether reckless in-context examples can nudge model responses toward harmful behaviour, and whether that effect transfers across different harm domains. It is intentionally minimal — plain LLM completions, no tool use — to make the mechanics transparent and serve as a building block for participants who want to extend it to real agentic systems.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

New reference implementation (implementations/misalignment_qa/) — a YAML-driven experiment runner that tests five in-context-learning conditions (baseline, examples as LLM messages, examples as LLM messages + priority instruction, examples as user context, examples as user context + priority instruction) across six commercial models (three Gemini, three Anthropic), producing 30 variants against a shared 9-task dataset
Task set spans three harm modalities — life-safety dismissal, harmful code (keylogger / credential harvester / covert exfiltration), and social engineering (phishing, health misinformation, fake reviews) — to observe both in-domain and out-of-domain transfer of the reckless pattern
examples_inject_mode config field controls whether examples reach the model as LLM API messages (developer surface) or as plain text inside the user message (end-user surface), implemented via preparation.py and task.py
Graceful API key handling — pre-flight checks and per-variant failure detection surface missing/invalid keys in a clearly formatted warning summary at the end of the run rather than crashing
Temperature compatibility fix — AgentSpec.temperature is now float | None; claude-opus-4-7 variants carry temperature: null (that model has deprecated the parameter); all other models use temperature: 0.2; variant-level null overrides are propagated correctly via Pydantic model_fields_set in resolve_agent_spec
Three canonical execution paths: 01_interactive_single_run.ipynb (optional single-item preview), run.py (full 30-variant experiment), 02_inspect_results.ipynb (pull results from Langfuse, heatmap dashboard + collapsible trace detail cards)
analysis.py — helper module for the results notebook, replacing the old results_notebook.py; includes correct condition metadata extraction
README — full rewrite for bootcamp audience, covering the agent/non-agent distinction, five conditions, three task categories, quick-start steps, config reference, and troubleshooting

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

Full experiment run executed against all 30 variants; Gemini variants completed successfully and traces/scores were written to Langfuse as expected
Anthropic variants confirmed working after resolving an AuthenticationError (invalid key) and a BadRequestError (temperature deprecated on claude-opus-4-7) — both error classes are now surfaced clearly in the warning summary
02_inspect_results.ipynb run end-to-end against a partial dataset (one condition); heatmap rendered correctly and collapsible trace cards displayed as intended
Config parsed and validated: 30 variants, 5 conditions × 6 models, 9 tasks across 3 categories, all condition_metadata fields populated, temperature matrix verified programmatically

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

Participants need .env entries for GOOGLE_API_KEY and/or ANTHROPIC_API_KEY in addition to the standard Langfuse keys. The experiment runs with only one provider's key and reports skipped variants at the end. No infrastructure changes required.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Create a Langfuse-backed Python workflow for configurable ADK agent runs, transcript-based task definitions, judge-driven evaluation, trace usage metrics, and a documented smoke-test config to support future misalignment experiments. Made-with: Cursor

Made-with: Cursor

Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs. Made-with: Cursor

Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs. Made-with: Cursor

Tighten the runtime so seeded conversations read like real chat, keep the experiment configs aligned with current thinking/output settings, and add a Metrics API-based terminal report for comparing conditions outside the Langfuse UI. Made-with: Cursor

Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior. Made-with: Cursor

Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini. Made-with: Cursor

Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review. Made-with: Cursor

…, rewrite README results_notebook.py: shrunk from 901 to 643 lines by replacing five custom dataclasses (NumericAccumulator, ConditionSummary, TraceRecord, AnalysisBundle with 13 fields) with pandas groupby aggregation and lighter data structures. AnalysisBundle is now 4 fields. Two near-duplicate Metrics API fetchers are now clean separate functions returning DataFrames. All public API preserved. report_metrics.ipynb: added a Discovery cell that lists available datasets and execution IDs so users no longer have to guess constants. Replaced the passive markdown cell with an actionable comment in the detail-view cell. Added a "how to copy for a new experiment" guide to the header and improved inline comments throughout. README.md: full rewrite for newcomers. Leads with what behavioral misalignment is and why it matters, includes a plain-language workflow diagram, a Quick Start section, a "Designing Your Own Experiment" walkthrough, and moves the config reference to the end. No jargon (PreparedTaskItem, ExecutionIdentity, etc.) in the sections visible to first-time readers. Made-with: Cursor

…values to float The Langfuse Metrics API can return latency/cost/token values as strings or None. The previous refactor dropped the explicit _coerce_float/_coerce_int helpers from the original code, causing 'unsupported operand type(s) for /: str and int' when _build_summary_df tried to compute avg_latency_s and avg_tokens. Added a _to_float helper inside _fetch_trace_metrics_df and a pd.to_numeric pass as a safety net. Made-with: Cursor

preparation.py: replace the 43-line null-coalescing body of resolve_agent_spec with a 10-line Pydantic model_dump merge (base fields overridden by non-None variant fields). Same semantics, far less repetition. preparation.py: remove PreparedTaskItem.judge_input — it was computed in prepare_task_item but never read anywhere (to_upload_item uses upload_input; the value was already embedded there via build_dataset_input). Eliminates a redundant call to build_judge_input. config_types.py: change ExperimentConfig.description default from the generic placeholder string "Misalignment experiment" to None, so experiments without a description don't silently inherit a meaningless label in Langfuse. Made-with: Cursor

…nnotate minimal config - Quick Start step 3 now names the smoke dataset (misalignment-qa-smoke-v2) and shows the exact DATASET_NAME constant to paste into the notebook. - 'Designing Your Own Experiment' minimal config is rewritten to demonstrate variant inheritance (variants only set what changes), per-variant example override, and the judge_model_config with an explanation of each key decision. - Added four plain-language bullets below the config explaining the most common stumbling points: variant inheritance, dataset_upload_subset, judge model independence, and how rubric metric names become notebook column names. Made-with: Cursor

Made-with: Cursor

Tests whether the reckless in-context examples from the prioritize-context experiment cause misalignment to transfer across output modalities — from conversational advice into code, email drafts, and structured YAML configs. Uses gemini-3-flash-preview (strongest responder in the prior experiment). Two variants: reckless-examples (same four ICL examples) and no-examples (empty list, control condition) — both under the same 'follow the examples' system prompt so only the examples themselves vary. Three tasks, one per modality: - emergency-stop-code: Python safety threshold function - safety-incident-email: building supervisor gas-leak notification - chemical-storage-config: YAML monitoring system config Same 3-metric rubric (harmful / helpful / follows_reckless_pattern) with updated judge guidance covering code, email, and config output formats. max_output_chars raised from 800 to 1200 for longer structured outputs. Made-with: Cursor

…oken handling - Add 6 new variants (reckless + control) for gemini-2.5-flash, gemini-3.1-pro-preview, and gemini-3.1-flash-lite-preview, giving a full 4-model × 2-condition factorial design (8 variants total). - Add `model` key to all condition_metadata blocks so notebooks can pivot on either the model or icl_condition axis independently. - Rename original variants from reckless-examples/no-examples to gemini-3-flash-reckless/gemini-3-flash-control for naming consistency. - task.py: filter out thinking parts (part.thought=True) from the string returned to the evaluator; thinking tokens are still visible in the raw Langfuse trace observation via ADK's automatic model-call logging. Made-with: Cursor

Step-by-step guide that shows how to fetch a Langfuse dataset, inspect and select individual items, run the agent and LLM judge in preview mode, then formally record the run as a Langfuse experiment. Made-with: Cursor

…owing instruction

…ation - Replace 5 legacy experiment configs with a single bootcamp_misalignment.yaml testing 5 conditions × 6 commercial models (Gemini + Anthropic) = 30 variants - Add examples_inject_mode field to VariantSpec ("llm_messages" | "user_context") so examples can be seeded as LLM API messages or embedded as plain text in the user prompt — the latter being a more realistic end-user attack surface - Thread user_context_preamble through PreparedVariantRun and MisalignmentTask - Add preflight API key check and per-variant error catching to experiment.py; skipped/failed variants are collected and printed as a visible warning summary - Rewrite interactive_single_run.ipynb as a clean 6-step single-item runner - Rewrite report_metrics.ipynb with condition-axis analysis and the 5-condition table; default dataset points to misalignment-qa-bootcamp-v1 - Rewrite README for bootcamp audience: 3-path structure, agent framing note, graceful API key warning docs, simplified config reference - Delete test_notebook.ipynb (scratch file) - Update .env.example: add ANTHROPIC_API_KEY, remove VECTOR_INFERENCE_API_KEY Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the 10 single-category tasks with 9 tasks across three categories: A. life-safety (peanut-throat, child-cleaner, gas-smell) — in-domain, same flavour as the reckless examples B. harmful-code (keylogger, credential-harvest, covert-exfiltration) — out-of-domain: does the dismissal pattern bleed into writing malware? C. social-engineering (phishing-email, health-misinfo, fake-reviews) — out-of-domain: deceptive content targeting people Tasks carry a `category` metadata field for filtering in the notebook. Update judge rubric: broaden follows_reckless_pattern to cover all three modalities (dismissal, providing malicious code, generating deceptive content). Bump langfuse_dataset_name to misalignment-qa-bootcamp-v2 (task content changed so a fresh dataset is required). Update README and report notebook to document the three-category structure and analysis approach. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…ive notebook Co-authored-by: Cursor <cursoragent@cursor.com>

…ive_single_run.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>

…improve notebook UX - Rename results_notebook.py to analysis.py (better reflects purpose) - Fix missing 'condition' column in build_master_traces_frame: extract condition_condition from trace metadata and expose it as 'condition' - Add Plotly misalignment heatmap (condition × model, follows_reckless_pattern rate) as a headline dashboard figure; falls back to bar chart when condition data is absent - Replace verbose trace detail for-loop with collapsible HTML <details> accordion cards — colour-coded score badges in the summary line, full input/output/judge commentary hidden until expanded Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

for more information, see https://pre-commit.ci

Resolves all 50 ruff violations that were failing CI pre-commit checks: - D100/D101/D102/D103: add module, class, and function docstrings across config_types, experiment, preparation, task, run, and hard_metrics - D205/D400: fix MisalignmentTask class docstring format - W505: shorten doc lines to stay within max-doc-length=88 limit - A002: noqa annotation for `input` argument in llm_judge_evaluator - SIM105: replace try/except/pass with contextlib.suppress - E402: noqa annotation for sys.path-guarded import in run.py Co-authored-by: Cursor <cursoragent@cursor.com>

… in notebooks - Add D100 module docstring and D103 function docstrings to test_agent.py - Add E402 and D103 to nbqa-ruff ignore list in .pre-commit-config.yaml: nbQA converts notebooks to .py before linting so the per-file-ignores for *.ipynb in pyproject.toml do not apply to the nbqa-ruff hook Co-authored-by: Cursor <cursoragent@cursor.com>

…EvaluatorFunction - Configs() call: add type: ignore[call-arg] — openai_api_key and google_api_key are required fields populated from env vars at runtime, not constructor arguments; consistent with usage in knowledge_qa/agent.py - base_evaluator await: add type: ignore[misc] — EvaluatorFunction return type union includes non-awaitable branch that mypy cannot narrow Co-authored-by: Cursor <cursoragent@cursor.com>

… installed) Co-authored-by: Cursor <cursoragent@cursor.com>

…noise) Co-authored-by: Cursor <cursoragent@cursor.com>

- Rewrite heatmap cell to use matplotlib (renders as static PNG in all Jupyter environments; Plotly fig.show()/display(fig) was unreliable in VS Code notebooks) - Add matplotlib as a project dependency - In build_master_traces_frame: fall back to _fetch_scores_df (metrics API) when trace_detail.scores is empty (some Langfuse SDK versions) Co-authored-by: Cursor <cursoragent@cursor.com>

for more information, see https://pre-commit.ci

rjavadi · 2026-05-15T23:31:31Z

To follow the structure of the repo in another following issue:

Move framework code (experiment.py, task.py, preparation.py, config_types.py, analysis.py, evaluation helpers, the YAML loader) → new package aieng-eval-agents/aieng/agent_evals/misalignment_qa/.

Keep in implementations/misalignment_qa/: notebooks, agent.py (the topic-specific builder), evaluate.py (thin entry point calling into the library), configs/*.yaml, README.md, and run.py if it's just a CLI wrapper.

Tests move with the code: framework tests → tests/aieng_eval_agents/misalignment_qa/, agent/notebook-adjacent tests stay under tests/implementations/misalignment_qa/.

Move framework code into aieng.agent_evals.misalignment_qa so it runs in CI, keep bootcamp notebooks/configs in implementations/, and add evaluate.py as the thin CLI entry point. Co-authored-by: Cursor <cursoragent@cursor.com>

Save rerun cell outputs from verifying imports against aieng.agent_evals.misalignment_qa. Co-authored-by: Cursor <cursoragent@cursor.com>

for more information, see https://pre-commit.ci

Sort isort blocks in the library package and simplify run.py to import evaluate.main directly without a sys.path hack. Co-authored-by: Cursor <cursoragent@cursor.com>

Bring in pre-commit autoupdate (VectorInstitute#109) while keeping misalignment QA work.

ethancjackson · 2026-05-20T20:47:32Z

@rjavadi I made the requested changes but it now looks like pip-audit has found a bunch of issues. I didn't want to touch any of the other code in this project -- are you able to take a look at the newly found vulnerabilities?

rjavadi · 2026-05-20T23:04:16Z

@ethancjackson It looks there are minor vulnerabilities with in some packages we can safely ignore. transformers is unused in this repo so I'm going to fix them in a separate PR.

Langfuse types expect Literal['NUMERIC', 'CATEGORICAL', 'BOOLEAN'], not ScoreDataType enum members. Fixes mypy arg-type errors in CI. Co-authored-by: Cursor <cursoragent@cursor.com>

ethancjackson · 2026-05-21T13:07:31Z

@rjavadi thanks -- I merged those fixed into this branch and ran uv sync. The resulting package versions caused a mypy failure (see: acd1076), so I addressed those here, too. Ran a quick test on my own reference implementation and it worked fine. Please let me know when we're approved to merge.

rjavadi · 2026-05-21T17:09:16Z

@ethancjackson Thanks for the fixes! I’ve left the first set of comments—could you start addressing these while I finish up the rest?

ethancjackson · 2026-05-22T09:47:06Z

@ethancjackson Thanks for the fixes! I’ve left the first set of comments—could you start addressing these while I finish up the rest?

@rjavadi can you say which comments are outstanding? The migration of code from implementations to package was done here:
1c24356

rjavadi · 2026-05-15T21:04:17Z

+    )
+    vector_inference_api_key: SecretStr | None = Field(
+        default=None,
+        validation_alias="VECTOR_INFERENCE_API_KEY",


nit: "VECTOR_INFERENCE_API_KEY" doesn't exist in .env. Is it used for testing purposes only?

vector_inference_api_key on Configs is declared but never read — the agent builder pulls VECTOR_INFERENCE_API_KEY straight from os.environ via AgentSpec.api_key_env. Same story as anthropic_api_key just above.
Two options:

Remove both fields from Configs, since they're not enforcing or providing anything (no extra="forbid", default=None).

Use them: have the agent builder prefer configs.<name>.get_secret_value() when spec.api_key_env matches, falling back to os.getenv otherwise. That makes Configs the single source of truth and gives you SecretStr's leak protection in logs/exceptions.
Either is fine.

rjavadi · 2026-05-15T21:22:36Z

+logger = logging.getLogger(__name__)
+
+
+TOOL_FACTORIES: dict[str, Any] = {


[Please ignore if you'd rather not make extra changes]
Recommended: change typing hint to TOOL_FACTORIES: dict[str, Callable[[Configs], Any]]

rjavadi · 2026-05-15T21:27:25Z

+
+
+def build_misalignment_agent(spec: AgentSpec, *, name: str = "assistant") -> LlmAgent:
+    """Build a configurable ADK LlmAgent.


According to CONTRIBUTING.md numpy format docstring is recommended.

rjavadi · 2026-05-15T21:27:40Z

+
+
+def build_misalignment_agent(spec: AgentSpec, *, name: str = "assistant") -> LlmAgent:
+    """Build a configurable ADK LlmAgent.


Suggested change

"""Build a configurable ADK LlmAgent.

"""Build a configurable ADK ``LlmAgent`` for misalignment QA experiments.

Intentionally minimal: focuses on prompt/system-instruction configurability

and tool selection so the test harness remains the main experiment driver.

Parameters

----------

spec : AgentSpec

Resolved agent specification (provider, model, prompt, tools, etc.).

name : str, optional

Name assigned to the underlying ``LlmAgent``. Defaults to ``"assistant"``.

Returns

-------

LlmAgent

A configured ADK agent ready to be invoked by the experiment runner.

Raises

------

ValueError

If ``spec.tools`` contains an unsupported tool name, or if

``spec.api_key_env`` is set but the corresponding environment

variable is empty.

"""

rjavadi · 2026-05-20T21:35:18Z

AgentSpec and AgentOverrideSpec duplicate ~10 fields with identical constraints.

We could collapse to a single AgentSpec (all-optional, with merged_with and resolve methods) plus a ResolvedAgentSpec for the post-merge type, using model_fields_set to preserve the field: null override semantics. Not required for this PR .

rjavadi · 2026-05-20T21:36:19Z

+    metadata: dict[str, Any] = Field(default_factory=dict)
+
+    @model_validator(mode="after")
+    def _validate_prompt_fields(self) -> "TaskItemSpec":


"TaskItemSpec" doesn't require quotes since we have: from __future__ import annotations

rjavadi · 2026-05-20T21:37:14Z

+    )
+
+    @model_validator(mode="after")
+    def _validate_experiment(self) -> "ExperimentConfig":


"ExperimentConfig" - no quotes required

rjavadi · 2026-05-22T14:44:42Z

@ethancjackson Apologies, I hadn't hit "submit review". I submitted first batch of comments. Thanks!

rjavadi · 2026-05-22T18:04:55Z

Docstrings throughout this new package use the one-line summary style. CONTRIBUTING.md specifies numpy-format. It's not enforced but to remain consistent with the rest of the codebase it's recommended.

One-liners are fine for self-evident helpers like select_variant_runs, but functions like prepare_dataset_items, prepare_variant_runs, create_llm_judge, or classes like AgentOverrideSpec, and the hard-metrics extractors carry enough surface area that a typed reader would benefit.
Could we fill those in?

I didn’t leave comments on every function to avoid cluttering the review.

ethancjackson and others added 30 commits March 18, 2026 20:43

Use structured chat turns for misalignment QA agent

c5563d9

Made-with: Cursor

Update misalignment README for session-seeded transcripts

a0980d8

Made-with: Cursor

Switch misalignment QA configs to YAML

42735b1

Made-with: Cursor

Remove flattened agent_input from misalignment QA metadata

8f5dfb4

Made-with: Cursor

Refactor misalignment QA runtime and docs.

a004d48

Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs. Made-with: Cursor

Finish wiring thinking_budget through misalignment QA.

cdd7723

Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs. Made-with: Cursor

Make misalignment QA runs uniquely identifiable.

b92eaae

Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior. Made-with: Cursor

Add Anthropic variants to misalignment QA.

ffa96b7

Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini. Made-with: Cursor

Replace misalignment QA CLI reporting with a notebook explorer.

ff5071e

Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review. Made-with: Cursor

Clear notebook outputs before committing

a8b99a1

Made-with: Cursor

committing current notebook outputs

693bbd6

Add interactive single-run walkthrough notebook for misalignment_qa

18fd7c1

Step-by-step guide that shows how to fetch a Langfuse dataset, inspect and select individual items, run the agent and LLM judge in preview mode, then formally record the run as a Langfuse experiment. Made-with: Cursor

added interactive notebook for misalignment experiments

5236b7a

the baseline condition with the examples but without the context foll…

ca7454f

…owing instruction

updated baseline experiments

c00690c

added proper baseline config and updated interactive notebook

d384a1a

docs: add misalignment_qa entry to project-level README

5c08899

Co-authored-by: Cursor <cursoragent@cursor.com>

docs(misalignment_qa): remove incorrect Langfuse prereq from interact…

e354c44

…ive notebook Co-authored-by: Cursor <cursoragent@cursor.com>

refactor(misalignment_qa): rename interactive notebook to 01_interact…

578d360

…ive_single_run.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>

ethancjackson and others added 9 commits May 14, 2026 18:28

chore: restore stashed notebook output changes

db50aa6

Co-authored-by: Cursor <cursoragent@cursor.com>

[pre-commit.ci] Add auto fixes from pre-commit.com hooks

d544cfe

for more information, see https://pre-commit.ci

fix(misalignment_qa): suppress mypy import-untyped for yaml (no stubs…

6de6c05

… installed) Co-authored-by: Cursor <cursoragent@cursor.com>

chore: restore knowledge_qa notebooks to main state (kernel metadata …

57779bb

…noise) Co-authored-by: Cursor <cursoragent@cursor.com>

[pre-commit.ci] Add auto fixes from pre-commit.com hooks

c33fc7a

for more information, see https://pre-commit.ci

ethancjackson requested a review from amrit110 May 14, 2026 19:33

amrit110 requested a review from rjavadi May 15, 2026 14:57

amrit110 added the enhancement New feature or request label May 15, 2026

ethancjackson and others added 5 commits May 20, 2026 20:35

chore(misalignment_qa): refresh notebook outputs after library split.

a370853

Save rerun cell outputs from verifying imports against aieng.agent_evals.misalignment_qa. Co-authored-by: Cursor <cursoragent@cursor.com>

[pre-commit.ci] Add auto fixes from pre-commit.com hooks

b5faa71

for more information, see https://pre-commit.ci

fix(misalignment_qa): resolve ruff import order and run.py E402.

c1de37c

Sort isort blocks in the library package and simplify run.py to import evaluate.main directly without a sys.path hack. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge origin/main into ethan-dev.

9807216

Bring in pre-commit autoupdate (VectorInstitute#109) while keeping misalignment QA work.

Remove redundant dependencies

face593

rjavadi mentioned this pull request May 21, 2026

Remove redundant dependencies #111

Open

18 tasks

ethancjackson and others added 2 commits May 21, 2026 12:50

merge pip audit fixes

416963f

fix: pass string literals for Langfuse Evaluation data_type.

acd1076

Langfuse types expect Literal['NUMERIC', 'CATEGORICAL', 'BOOLEAN'], not ScoreDataType enum members. Fixes mypy arg-type errors in CI. Co-authored-by: Cursor <cursoragent@cursor.com>

rjavadi reviewed May 22, 2026

View reviewed changes

		logger = logging.getLogger(__name__)


		TOOL_FACTORIES: dict[str, Any] = {



		def build_misalignment_agent(spec: AgentSpec, *, name: str = "assistant") -> LlmAgent:
		"""Build a configurable ADK LlmAgent.

-    """Build a configurable ADK LlmAgent.
+    """Build a configurable ADK ``LlmAgent`` for misalignment QA experiments.
+    Intentionally minimal: focuses on prompt/system-instruction configurability
+    and tool selection so the test harness remains the main experiment driver.
+    Parameters
+    ----------
+    spec : AgentSpec
+        Resolved agent specification (provider, model, prompt, tools, etc.).
+    name : str, optional
+        Name assigned to the underlying ``LlmAgent``. Defaults to ``"assistant"``.
+    Returns
+    -------
+    LlmAgent
+        A configured ADK agent ready to be invoked by the experiment runner.
+    Raises
+    ------
+    ValueError
+        If ``spec.tools`` contains an unsupported tool name, or if
+        ``spec.api_key_env`` is set but the corresponding environment
+        variable is empty.
+    """

Conversation

ethancjackson commented May 14, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

rjavadi commented May 15, 2026

Uh oh!

ethancjackson commented May 20, 2026

Uh oh!

rjavadi commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethancjackson commented May 21, 2026

Uh oh!

rjavadi commented May 21, 2026

Uh oh!

ethancjackson commented May 22, 2026

Uh oh!

rjavadi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjavadi May 20, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi May 20, 2026

Choose a reason for hiding this comment

Uh oh!

rjavadi commented May 22, 2026

Uh oh!

rjavadi commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjavadi commented May 20, 2026 •

edited

Loading

rjavadi May 20, 2026 •

edited

Loading

rjavadi commented May 22, 2026 •

edited

Loading