Skip to content

Checkpointing + Error Retry for Rollout Processor#80

Merged
xzrderek merged 10 commits intomainfrom
derekx/new-checkpointing
Aug 14, 2025
Merged

Checkpointing + Error Retry for Rollout Processor#80
xzrderek merged 10 commits intomainfrom
derekx/new-checkpointing

Conversation

@xzrderek
Copy link
Contributor

@xzrderek xzrderek commented Aug 14, 2025

high level changes:

  • added retry / option to ignore "permanent failures" by adding flags --ep-max-retry, --ep-fail-on-permanent-failure
  • changed rollout processors to now return a list of async tasks for separation of concerns of all error handling happening within evaluation_test.py, checkout conversation below with dylan
  • also made some large changes to rollout processor abstractions. all rollout processors inherit from ABC RolloutProcessor. per convo below, treat rollout processors as "strategies". additional benefit here is that if we want to do more sophisticated examples in the future like dialog rollout, or for the SVG pelican case, the chrome rendering can be cached and kept in the processor. let me know your thoughts.

@xzrderek xzrderek requested review from benjibc and dphuang2 August 14, 2025 04:27
Comment on lines +115 to +116
r.rollout_status.status = "error"
r.rollout_status.termination_reason = str(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this try/except logic shouldn't live inside the rollouts. The retry logic and handling should happen all outside of the rollout to separate concerns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to eventually implement this ticket right: https://linear.app/fireworks/issue/FIR-4431/implement-exception-handling-by-using-python-backoff-library

That means the native Python exception handling should happen outside of the rollout processor. Right now we don't have that information if you just catch all exceptions and use rollout_status to communicate the error happened

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should ideally remove the try/except altogether in the rollouts and have the outer evaluation_test handle everything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, the intention here is because from yinghan's changes, the default_mcp_gym_rollout_processor.py is what handles the rollout_status, so i thought same should apply to single turn and agent. given that, should evaluation_test still handle?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both evaluation test and rollout processors should update rollout status but the error handling should all be in evaluation test

Copy link
Contributor Author

@xzrderek xzrderek Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue here is if we let the error handling be done in evaluation test, we lose per-row error tracking, and we will get a batch-level failure, which defeats the purpose of pipelining. in other words the entire rollout_processor fails, and there's no way to distinguish which row failed if we try to handle it in evaluation test.

Copy link
Collaborator

@dphuang2 dphuang2 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you return a list of asyncio tasks from the rollout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that works for my case, but might mess with yinghan's case. lemme make the changes then ping him.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you were right!

@xzrderek xzrderek requested a review from dphuang2 August 14, 2025 09:03
@xzrderek xzrderek requested a review from mayinghan August 14, 2025 17:16
@xzrderek xzrderek merged commit 3318f22 into main Aug 14, 2025
1 check passed
@xzrderek xzrderek deleted the derekx/new-checkpointing branch August 14, 2025 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants