feat: add job to cancel stalled retrievals #1233

jacobheun · 2023-02-28T13:51:28Z

Summary

As part of debugging #1125 we determined that stalled retrievals were never being cleaned up by the go-fil-markets code. This was originally intended to allow resumption of retrieval, however, there was never a cap put on the duration and retrievals would require manual termination. This behavior doesn't scale.

There is refactoring that is needed for the go-fil-markets retrieval codepaths as a separate effort, but in the interim, this will provide a short term fix for canceling stalled retrieval.

This new job runs every 5 minutes and will cancel all retrievals in the Retrieval Log that have not been updated in the past 30 minutes (configurable). 30 minutes should be a reasonable default for clients to attempt to reconnect and resume transfer if desired.

Caveats of this approach

I have noted in the config type docs that the RetrievalLog duration must exceed the stalled retrieval duration in order for this to function properly. The reason the RetrievalLog is being used here for pruning is that it is significantly more efficient than querying the go-fil-markets statemachines, as the latter loads everything into memory. Once we have refactored the legacy retrieval code, we can migrate to using a more thorough check for stalled retrieval than the logs db.

Devnet Testing

I set up my local devnet to prune stalled retrievals older than 5 minutes and executed 100 retrievals and abruptly terminated their connection from the client. After the 5 minute stall window was passed, all deals were executed.

dirkmc

Love the comprehensive description 🤩

I left a couple of small suggestions, feel free to merge when ready.

node/config/types.go

retrievalmarket/rtvllog/retrieval_log.go

add warning if logs duration is less than stalled retrieval timeout

benjaminh83 · 2023-03-01T11:04:03Z

Just checked out latest master. Its reporting: boostd version 1.6.0-rc1+git.b0de635

I wonder if this version of boost isn't a bit too trigger happy on the cancels? Seems like a lot of deals are uploading, men then just getting canceled.

jacobheun · 2023-03-01T11:33:22Z

@benjaminh83 thanks for sharing this, did you update the config at all for the new StalledRetrievalTimeout?

By default this checks deals that have not been updated in 30minutes. I don't think any of these cancels are actually coming from this update as they're happening too soon and we're still seeing valid transfers, but it would be helpful for us to try and set the message on the transfer that the job timed it out so it's easier to see when those are actually happening. (If you haven't changed your config, I suspect the ResponderPaused retrievals will time out in 30mins if something else doesn't terminate/complete them before that).

As part of diagnosing this we also discovered that some Lassie which is being used be some of the larger retrieval systems (such as Autoretrieve) weren't gracefully canceling retrievals they no longer needed, such as during parallel fetches. That was fixed and deployed yesterday, which explains some of what we're seeing here with the "client cancelled retrieval", and requests for the same CID showing cancels followed by a successful retrieval.

I'm not clear why we're not seeing "client cancelled retrieval" on more of these, I suspect it's a timing issue but will dig into that.

jacobheun · 2023-03-01T11:45:47Z

FYI I created #1238 to make it easier to quickly see if the PeerIDs are different. I'd like to know how many of these are coming from different peers.

* feat: add job to clean up stalled retrievals * chore: run configs doc gen * chore: refactor stalled retrieval name add warning if logs duration is less than stalled retrieval timeout

jacobheun added 2 commits February 28, 2023 14:25

feat: add job to clean up stalled retrievals

2957208

chore: run configs doc gen

6c9a19d

jacobheun requested a review from dirkmc February 28, 2023 13:52

jacobheun self-assigned this Feb 28, 2023

dirkmc approved these changes Feb 28, 2023

View reviewed changes

node/config/types.go Outdated Show resolved Hide resolved

retrievalmarket/rtvllog/retrieval_log.go Show resolved Hide resolved

chore: refactor stalled retrieval name

343ea12

add warning if logs duration is less than stalled retrieval timeout

jacobheun merged commit e31ceda into main Feb 28, 2023

jacobheun deleted the fix/stalled-retrieval branch February 28, 2023 14:27

jacobheun mentioned this pull request Mar 1, 2023

feat: back port add job to cancel stalled retrievals (#1233) #1239

Merged

jacobheun mentioned this pull request Mar 13, 2023

Canceling a retrieval deal on the client can cause data transfer to get stuck in 'Ongoing' on the server side #1132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add job to cancel stalled retrievals #1233

feat: add job to cancel stalled retrievals #1233

jacobheun commented Feb 28, 2023

dirkmc left a comment

benjaminh83 commented Mar 1, 2023

jacobheun commented Mar 1, 2023

jacobheun commented Mar 1, 2023

feat: add job to cancel stalled retrievals #1233

feat: add job to cancel stalled retrievals #1233

Conversation

jacobheun commented Feb 28, 2023

Summary

Caveats of this approach

Devnet Testing

dirkmc left a comment

Choose a reason for hiding this comment

benjaminh83 commented Mar 1, 2023

jacobheun commented Mar 1, 2023

jacobheun commented Mar 1, 2023