-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add job to cancel stalled retrievals #1233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the comprehensive description 🤩
I left a couple of small suggestions, feel free to merge when ready.
add warning if logs duration is less than stalled retrieval timeout
@benjaminh83 thanks for sharing this, did you update the config at all for the new By default this checks deals that have not been updated in 30minutes. I don't think any of these cancels are actually coming from this update as they're happening too soon and we're still seeing valid transfers, but it would be helpful for us to try and set the message on the transfer that the job timed it out so it's easier to see when those are actually happening. (If you haven't changed your config, I suspect the As part of diagnosing this we also discovered that some Lassie which is being used be some of the larger retrieval systems (such as Autoretrieve) weren't gracefully canceling retrievals they no longer needed, such as during parallel fetches. That was fixed and deployed yesterday, which explains some of what we're seeing here with the "client cancelled retrieval", and requests for the same CID showing cancels followed by a successful retrieval. I'm not clear why we're not seeing "client cancelled retrieval" on more of these, I suspect it's a timing issue but will dig into that. |
FYI I created #1238 to make it easier to quickly see if the PeerIDs are different. I'd like to know how many of these are coming from different peers. |
* feat: add job to clean up stalled retrievals * chore: run configs doc gen * chore: refactor stalled retrieval name add warning if logs duration is less than stalled retrieval timeout
Summary
As part of debugging #1125 we determined that stalled retrievals were never being cleaned up by the go-fil-markets code. This was originally intended to allow resumption of retrieval, however, there was never a cap put on the duration and retrievals would require manual termination. This behavior doesn't scale.
There is refactoring that is needed for the go-fil-markets retrieval codepaths as a separate effort, but in the interim, this will provide a short term fix for canceling stalled retrieval.
This new job runs every 5 minutes and will cancel all retrievals in the Retrieval Log that have not been updated in the past 30 minutes (configurable). 30 minutes should be a reasonable default for clients to attempt to reconnect and resume transfer if desired.
Caveats of this approach
I have noted in the config type docs that the RetrievalLog duration must exceed the stalled retrieval duration in order for this to function properly. The reason the RetrievalLog is being used here for pruning is that it is significantly more efficient than querying the go-fil-markets statemachines, as the latter loads everything into memory. Once we have refactored the legacy retrieval code, we can migrate to using a more thorough check for stalled retrieval than the logs db.
Devnet Testing
I set up my local devnet to prune stalled retrievals older than 5 minutes and executed 100 retrievals and abruptly terminated their connection from the client. After the 5 minute stall window was passed, all deals were executed.