Checkpointing + Error Retry for Rollout Processor#80
Conversation
| r.rollout_status.status = "error" | ||
| r.rollout_status.termination_reason = str(e) |
There was a problem hiding this comment.
I actually think this try/except logic shouldn't live inside the rollouts. The retry logic and handling should happen all outside of the rollout to separate concerns
There was a problem hiding this comment.
we want to eventually implement this ticket right: https://linear.app/fireworks/issue/FIR-4431/implement-exception-handling-by-using-python-backoff-library
That means the native Python exception handling should happen outside of the rollout processor. Right now we don't have that information if you just catch all exceptions and use rollout_status to communicate the error happened
There was a problem hiding this comment.
we should ideally remove the try/except altogether in the rollouts and have the outer evaluation_test handle everything
There was a problem hiding this comment.
gotcha, the intention here is because from yinghan's changes, the default_mcp_gym_rollout_processor.py is what handles the rollout_status, so i thought same should apply to single turn and agent. given that, should evaluation_test still handle?
There was a problem hiding this comment.
I think both evaluation test and rollout processors should update rollout status but the error handling should all be in evaluation test
There was a problem hiding this comment.
the issue here is if we let the error handling be done in evaluation test, we lose per-row error tracking, and we will get a batch-level failure, which defeats the purpose of pipelining. in other words the entire rollout_processor fails, and there's no way to distinguish which row failed if we try to handle it in evaluation test.
There was a problem hiding this comment.
what if you return a list of asyncio tasks from the rollout
There was a problem hiding this comment.
i think that works for my case, but might mess with yinghan's case. lemme make the changes then ping him.
high level changes:
--ep-max-retry,--ep-fail-on-permanent-failureRolloutProcessor. per convo below, treat rollout processors as "strategies". additional benefit here is that if we want to do more sophisticated examples in the future like dialog rollout, or for the SVG pelican case, the chrome rendering can be cached and kept in the processor. let me know your thoughts.