raise errors by default by xzrderek · Pull Request #320 · eval-protocol/python-sdk

xzrderek · 2025-11-07T06:58:27Z

Note

Introduce execute_pytest_with_exception_handling and wire it into evaluation_test to standardize evaluator error handling (configurable via EP_RAISE_EVAL_EXCEPTIONS), with comprehensive tests.

Pytest evaluation flow:
- Replace inline try/except blocks in eval_protocol/pytest/evaluation_test.py with calls to execute_pytest_with_exception_handling for pointwise, groupwise, and all-mode execution.
Execution helper:
- Add execute_pytest_with_exception_handling in eval_protocol/pytest/execution.py to wrap execute_pytest and, when EP_RAISE_EVAL_EXCEPTIONS is false, convert exceptions into evaluation_result (score=0, invalid) and Status.error on affected EvaluationRow(s); otherwise re-raise.
- Update imports/types to support error annotation (EvaluateResult, Status, Any, os).
Tests:
- Add tests/pytest/test_pytest_evaluator_error_handling.py covering pointwise/groupwise paths, multiple runs, custom/empty errors, input_rows, status codes, and reason formatting (with fixture forcing exception capture).

^{Written by Cursor Bugbot for commit 5fce8ab. This will update automatically on new commits. Configure here.}

cursor

Bug: Inconsistent Exception Handling Across Evaluation Modes

The "all" mode (batch mode) lacks the exception handling added to pointwise and groupwise modes. While pointwise and groupwise modes now capture non-assert exceptions and convert them to evaluation results with score=0 when EP_CAPTURE_EVAL_EXCEPTIONS=1, the "all" mode has no try-except block around execute_pytest, causing non-assert exceptions to propagate and fail the entire test. This creates inconsistent exception handling behavior across the three evaluation modes.

eval_protocol/pytest/evaluation_test.py#L567-L600

python-sdk/eval_protocol/pytest/evaluation_test.py

Lines 567 to 600 in d5ea771

    
               all_results[run_idx] = results 
        
           else: 
        
               # Batch mode: collect all results first, then evaluate (no pipelining) 
        
               input_dataset = [] 
        
               async for row in rollout_processor_with_retry( 
        
                   rollout_processor, fresh_dataset, config, run_idx 
        
               ): 
        
                   input_dataset.append(row) 
        
               # NOTE: we will still evaluate errored rows (give users control over this) 
        
               # i.e., they can choose to give EvaluateResult.score = 0 for errored rows in their test_func 
        
               primary_rollout_id = ( 
        
                   input_dataset[0].execution_metadata.rollout_id if input_dataset else None 
        
               ) 
        
               group_rollout_ids = [ 
        
                   r.execution_metadata.rollout_id 
        
                   for r in input_dataset 
        
                   if r.execution_metadata.rollout_id 
        
               ] 
        
               async with rollout_logging_context( 
        
                   primary_rollout_id or "", 
        
                   experiment_id=experiment_id, 
        
                   run_id=run_id, 
        
                   rollout_ids=group_rollout_ids or None, 
        
               ): 
        
                   results = await execute_pytest( 
        
                       test_func, 
        
                       processed_dataset=input_dataset, 
        
                       evaluation_test_kwargs=kwargs.get("evaluation_test_kwargs") or {}, 
        
                   ) 
        
               if ( 
        
                   results is None 
        
                   or not isinstance(results, list) 
        
                   or not all(isinstance(r, EvaluationRow) for r in results) 
        
               ):

cursor

Bug: Inconsistent Exception Handling Across Evaluation Modes

The "all" mode (batch mode) lacks the exception handling logic added to "pointwise" and "groupwise" modes. When execute_pytest is called in "all" mode, AssertionError and other exceptions aren't handled consistently with the other modes - there's no explicit AssertionError re-raise or conditional exception capture based on EP_CAPTURE_EVAL_EXCEPTIONS. This creates inconsistent behavior where assertion failures and exception handling differ depending on the evaluation mode.

eval_protocol/pytest/evaluation_test.py#L594-L615

python-sdk/eval_protocol/pytest/evaluation_test.py

Lines 594 to 615 in decf061

    
               ): 
        
                   results = await execute_pytest( 
        
                       test_func, 
        
                       processed_dataset=input_dataset, 
        
                       evaluation_test_kwargs=kwargs.get("evaluation_test_kwargs") or {}, 
        
                   ) 
        
               if ( 
        
                   results is None 
        
                   or not isinstance(results, list) 
        
                   or not all(isinstance(r, EvaluationRow) for r in results) 
        
               ): 
        
                   raise ValueError( 
        
                       f"Test function {test_func.__name__} did not return a list of EvaluationRow instances. You must return a list of EvaluationRow instances from your test function decorated with @evaluation_test." 
        
                   ) 
        
               if not results: 
        
                   raise ValueError( 
        
                       f"Test function {test_func.__name__} returned an empty list. You must return a non-empty list of EvaluationRow instances from your test function decorated with @evaluation_test." 
        
                   ) 
        
               all_results[run_idx] = results 
        
           for r in results: 
        
               add_cost_metrics(r)

eval_protocol/pytest/evaluation_test.py

cursor · 2025-11-07T23:03:35Z

eval_protocol/pytest/execution.py

+                raise ValueError("Neither processed_row nor processed_dataset was provided")
+        # Default: raise exceptions unless explicitly disabled
+        else:
+            raise


Bug: Assertion Errors Must Always Raise

AssertionError is caught and converted to error results when EP_RAISE_EVAL_EXCEPTIONS is "false", breaking pytest assertions and threshold checks. The PR title "raise on assert failure" and existing tests (e.g., test_assertion_error_no_new_rollouts.py) expect AssertionError to always be raised. The exception handler should check for AssertionError and re-raise it unconditionally before checking the environment variable.

xzrderek added 4 commits November 5, 2025 01:28

export ep params

f29110d

fix server path

6b9f133

remove print statement

bf126c1

raise on assert

d5ea771

cursor bot reviewed Nov 7, 2025

View reviewed changes

xzrderek requested review from dphuang2 and mayinghan November 7, 2025 07:00

xzrderek added 2 commits November 6, 2025 23:01

Merge branch 'main' into derekx/raise-on-assert

068cac5

Merge branch 'main' into derekx/raise-on-assert

decf061

cursor bot reviewed Nov 7, 2025

View reviewed changes

use true/false

aa1ce37

cursor bot reviewed Nov 7, 2025

View reviewed changes