Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add job to cancel stalled retrievals #1233

Merged
merged 3 commits into from
Feb 28, 2023
Merged

Conversation

jacobheun
Copy link
Contributor

Summary

As part of debugging #1125 we determined that stalled retrievals were never being cleaned up by the go-fil-markets code. This was originally intended to allow resumption of retrieval, however, there was never a cap put on the duration and retrievals would require manual termination. This behavior doesn't scale.

There is refactoring that is needed for the go-fil-markets retrieval codepaths as a separate effort, but in the interim, this will provide a short term fix for canceling stalled retrieval.

This new job runs every 5 minutes and will cancel all retrievals in the Retrieval Log that have not been updated in the past 30 minutes (configurable). 30 minutes should be a reasonable default for clients to attempt to reconnect and resume transfer if desired.

Caveats of this approach

I have noted in the config type docs that the RetrievalLog duration must exceed the stalled retrieval duration in order for this to function properly. The reason the RetrievalLog is being used here for pruning is that it is significantly more efficient than querying the go-fil-markets statemachines, as the latter loads everything into memory. Once we have refactored the legacy retrieval code, we can migrate to using a more thorough check for stalled retrieval than the logs db.

Devnet Testing

I set up my local devnet to prune stalled retrievals older than 5 minutes and executed 100 retrievals and abruptly terminated their connection from the client. After the 5 minute stall window was passed, all deals were executed.

Screenshot 2023-02-28 at 2 12 54 PM

Screenshot 2023-02-28 at 2 20 10 PM

@jacobheun jacobheun self-assigned this Feb 28, 2023
Copy link
Contributor

@dirkmc dirkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the comprehensive description 🤩

I left a couple of small suggestions, feel free to merge when ready.

node/config/types.go Outdated Show resolved Hide resolved
retrievalmarket/rtvllog/retrieval_log.go Show resolved Hide resolved
add warning if logs duration is less than stalled retrieval timeout
@jacobheun jacobheun merged commit e31ceda into main Feb 28, 2023
@jacobheun jacobheun deleted the fix/stalled-retrieval branch February 28, 2023 14:27
@benjaminh83
Copy link

Just checked out latest master. Its reporting: boostd version 1.6.0-rc1+git.b0de635

I wonder if this version of boost isn't a bit too trigger happy on the cancels? Seems like a lot of deals are uploading, men then just getting canceled.

image

@jacobheun
Copy link
Contributor Author

@benjaminh83 thanks for sharing this, did you update the config at all for the new StalledRetrievalTimeout?

By default this checks deals that have not been updated in 30minutes. I don't think any of these cancels are actually coming from this update as they're happening too soon and we're still seeing valid transfers, but it would be helpful for us to try and set the message on the transfer that the job timed it out so it's easier to see when those are actually happening. (If you haven't changed your config, I suspect the ResponderPaused retrievals will time out in 30mins if something else doesn't terminate/complete them before that).

As part of diagnosing this we also discovered that some Lassie which is being used be some of the larger retrieval systems (such as Autoretrieve) weren't gracefully canceling retrievals they no longer needed, such as during parallel fetches. That was fixed and deployed yesterday, which explains some of what we're seeing here with the "client cancelled retrieval", and requests for the same CID showing cancels followed by a successful retrieval.

I'm not clear why we're not seeing "client cancelled retrieval" on more of these, I suspect it's a timing issue but will dig into that.

@jacobheun
Copy link
Contributor Author

FYI I created #1238 to make it easier to quickly see if the PeerIDs are different. I'd like to know how many of these are coming from different peers.

jacobheun added a commit that referenced this pull request Mar 1, 2023
* feat: add job to clean up stalled retrievals

* chore: run configs doc gen

* chore: refactor stalled retrieval name

add warning if logs duration is less than stalled retrieval timeout
jacobheun added a commit that referenced this pull request Mar 1, 2023
* feat: add job to clean up stalled retrievals

* chore: run configs doc gen

* chore: refactor stalled retrieval name

add warning if logs duration is less than stalled retrieval timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants