Skip to content

raise errors by default#320

Merged
xzrderek merged 12 commits intomainfrom
derekx/raise-on-assert
Nov 7, 2025
Merged

raise errors by default#320
xzrderek merged 12 commits intomainfrom
derekx/raise-on-assert

Conversation

@xzrderek
Copy link
Contributor

@xzrderek xzrderek commented Nov 7, 2025

Note

Introduce execute_pytest_with_exception_handling and wire it into evaluation_test to standardize evaluator error handling (configurable via EP_RAISE_EVAL_EXCEPTIONS), with comprehensive tests.

  • Pytest evaluation flow:
    • Replace inline try/except blocks in eval_protocol/pytest/evaluation_test.py with calls to execute_pytest_with_exception_handling for pointwise, groupwise, and all-mode execution.
  • Execution helper:
    • Add execute_pytest_with_exception_handling in eval_protocol/pytest/execution.py to wrap execute_pytest and, when EP_RAISE_EVAL_EXCEPTIONS is false, convert exceptions into evaluation_result (score=0, invalid) and Status.error on affected EvaluationRow(s); otherwise re-raise.
    • Update imports/types to support error annotation (EvaluateResult, Status, Any, os).
  • Tests:
    • Add tests/pytest/test_pytest_evaluator_error_handling.py covering pointwise/groupwise paths, multiple runs, custom/empty errors, input_rows, status codes, and reason formatting (with fixture forcing exception capture).

Written by Cursor Bugbot for commit 5fce8ab. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent Exception Handling Across Evaluation Modes

The "all" mode (batch mode) lacks the exception handling added to pointwise and groupwise modes. While pointwise and groupwise modes now capture non-assert exceptions and convert them to evaluation results with score=0 when EP_CAPTURE_EVAL_EXCEPTIONS=1, the "all" mode has no try-except block around execute_pytest, causing non-assert exceptions to propagate and fail the entire test. This creates inconsistent exception handling behavior across the three evaluation modes.

eval_protocol/pytest/evaluation_test.py#L567-L600

all_results[run_idx] = results
else:
# Batch mode: collect all results first, then evaluate (no pipelining)
input_dataset = []
async for row in rollout_processor_with_retry(
rollout_processor, fresh_dataset, config, run_idx
):
input_dataset.append(row)
# NOTE: we will still evaluate errored rows (give users control over this)
# i.e., they can choose to give EvaluateResult.score = 0 for errored rows in their test_func
primary_rollout_id = (
input_dataset[0].execution_metadata.rollout_id if input_dataset else None
)
group_rollout_ids = [
r.execution_metadata.rollout_id
for r in input_dataset
if r.execution_metadata.rollout_id
]
async with rollout_logging_context(
primary_rollout_id or "",
experiment_id=experiment_id,
run_id=run_id,
rollout_ids=group_rollout_ids or None,
):
results = await execute_pytest(
test_func,
processed_dataset=input_dataset,
evaluation_test_kwargs=kwargs.get("evaluation_test_kwargs") or {},
)
if (
results is None
or not isinstance(results, list)
or not all(isinstance(r, EvaluationRow) for r in results)
):

Fix in Cursor Fix in Web


Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent Exception Handling Across Evaluation Modes

The "all" mode (batch mode) lacks the exception handling logic added to "pointwise" and "groupwise" modes. When execute_pytest is called in "all" mode, AssertionError and other exceptions aren't handled consistently with the other modes - there's no explicit AssertionError re-raise or conditional exception capture based on EP_CAPTURE_EVAL_EXCEPTIONS. This creates inconsistent behavior where assertion failures and exception handling differ depending on the evaluation mode.

eval_protocol/pytest/evaluation_test.py#L594-L615

):
results = await execute_pytest(
test_func,
processed_dataset=input_dataset,
evaluation_test_kwargs=kwargs.get("evaluation_test_kwargs") or {},
)
if (
results is None
or not isinstance(results, list)
or not all(isinstance(r, EvaluationRow) for r in results)
):
raise ValueError(
f"Test function {test_func.__name__} did not return a list of EvaluationRow instances. You must return a list of EvaluationRow instances from your test function decorated with @evaluation_test."
)
if not results:
raise ValueError(
f"Test function {test_func.__name__} returned an empty list. You must return a non-empty list of EvaluationRow instances from your test function decorated with @evaluation_test."
)
all_results[run_idx] = results
for r in results:
add_cost_metrics(r)

Fix in Cursor Fix in Web


raise ValueError("Neither processed_row nor processed_dataset was provided")
# Default: raise exceptions unless explicitly disabled
else:
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Assertion Errors Must Always Raise

AssertionError is caught and converted to error results when EP_RAISE_EVAL_EXCEPTIONS is "false", breaking pytest assertions and threshold checks. The PR title "raise on assert failure" and existing tests (e.g., test_assertion_error_no_new_rollouts.py) expect AssertionError to always be raised. The exception handler should check for AssertionError and re-raise it unconditionally before checking the environment variable.

Fix in Cursor Fix in Web

@xzrderek xzrderek changed the title raise on assert failure raise errors by default Nov 7, 2025
@xzrderek xzrderek merged commit cf59d13 into main Nov 7, 2025
9 checks passed
@xzrderek xzrderek deleted the derekx/raise-on-assert branch November 7, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants