New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Binance snapshot race condition #673
Conversation
@nirvana-msu does this fix your issue? Seems reasonable to me, but I haven't tested it |
@jinusean no it definitely does not fix it. It may make the exception go away, but the effect is even worse as it would instead lead to silently generating incorrect order book state. Actually, it doesn't even make exception go away - But that's not important. The fundamental issue here is that due to feed restarts, there may be multiple of these TL;DR is that there is no way for this to work correctly when there is a possibility of more than one |
Ah understood. So a cache of the running tasks and a cancellation of those tasks should suffice? |
Something of this sort. You'll probably also need to await cancelled tasks to avoid messages from asyncio that an exception has never been retrieved. So possibly something along the lines of: def _reset(self):
self._l2_book = {}
self.last_update_id = {}
if self.concurrent_http:
for task in self._concurrent_snapshot_task_cache:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
# buffer 'depthUpdate' book msgs until snapshot is fetched
self._book_buffer: Dict[str, deque[Tuple[dict, str, float]]] = {} But even then, I think there'd still be a possibility that a previous stream (the one currently running, preceding the reset) just didn't get to start the i.e. the issue here is that we have two connections running / processing updates at the same time. Perhaps a better way to deal with it would be to ensure a connection, along with all pending tasks, is cancelled at at a higher level (in |
maybe just use an asyncio mutex/semaphore so only one task can run for a symbol at a time? |
But it’s not just a matter of only having one run at a time. We have to ensure a task from an older connection would never be run after a task from a newer connection. If it does, it can screw up order book state (?). We somehow need to ensure that an old connection is completely cancelled with all corresponding tasks and pending coroutimes, before we proceed with handling book state in a new connection. |
perhaps this should just be removed them, it seems like its adding a lot of unnecessary complexity |
What should be removed? |
the parallel snapshot code. Its kind of unnecessary. The user can get the same behavior by creating a feed per symbol |
So some way of doing this concurrently is definitely needed - it’s a great boost to performance. I’m not sure if a starting a feed per symbol if feasible when e.g. subscribing to each symbol on Maybe the fix isn’t so hard.. I’m not sure if what I described above (with a possibility of a task not yet being in the cache when we reset, but added later) is actually possible - I’ll need to check what exactly happens when a connection is reset. Let me dig in the internals a bit to understand it better. |
Ok so looking at the code, I don't think the more complicated scenario I was describing is possible. By the time the new connection is created (and state is being reset), there should be nothing else running from the old connection, except for those dangling So unless I'm mistaken, it should be sufficient to simply keep track of those running @jinusean would you take a go at it? |
@nirvana-msu I've added the recommended changes but I'm still getting a KeyError on the initial cancellation. Could you take a look? |
Seems like a tough concurrency bug, I was trying to think of good solutions but couldn't come up with any. It feels like the biggest problem here is how I'll definitely check if this bug exists in the refresh snapshot logic in #606 when I have time, but since these snapshots are refreshed at most once per minute I think we should be fine on that front. |
One thing I can’t figure out is why the snapshot coroutine continues to run even after it’s cancelled. On successive resets the keyerror is not raised.
… On Oct 4, 2021, at 2:56 PM, Ville Kuosmanen ***@***.***> wrote:
Seems like a tough concurrency bug, I was trying to think of good solutions but couldn't come up with any.
It feels like the biggest problem here is how book_buffer is shared between coroutines, even though it's used as a local store for buffered messages. If each coroutine had its own book_buffer, I don't think having multiple ones running at the same time would really matter since the sequence numbers would help us filter out duplicate updates. Of course, if you made book_buffer local to each coroutine, we would still need a way to append new book messages to them and I don't know if there's a neat way to do that in Python.
I'll definitely check if this bug exists in the refresh snapshot logic in #606, but since these snapshots are refreshed at most once per minute I think we should be fine on that front.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
So I've done some debugging. I can explain why cancellation isn't working, and also I've found an even bigger flaw in the current concurrent_http logic. So here's exactly what's happening:
So lets continue with what happens: cryptofeed/cryptofeed/exchanges/binance.py Lines 267 to 276 in 8a899e2
4) Recall that we are still within the task wrapping _concurrent_snapshot (that's important!). While doing reset, we cancel the task and await cancellation:https://github.com/jinusean/cryptofeed/blob/fix-binance-buffer/cryptofeed/exchanges/binance.py#L168-L171 We then suppress CancelledError . The confusing part here is that the error we've just caught is not the one that we wanted to catch when awaiting the task result. Instead, this is the CancelledError thrown inside the task itself. it's done that way to give a task the chance to clean up. So we're suppressing the cancellation error within the task and hence the task is never actually cancelled.And because the coroutine is never cancelled, and self._book_buffer gets reset, then KeyError is raised.
So there are two takeaways:
if std_pair in self._l2_book and std_pair not in self._book_buffer:
return await self._handle_book_msg(msg, pair, timestamp)
NB: Once you fix the first issue, the code will start working - because when the task is cancelled from outside it behaves properly. But second issue should still not be ignored - when in the future the code changes such that task cancellation would happen from the inside, the issue would re-appear. |
I can suggest one way how to tackle the second issue. We need to refactor the code to avoid ever having to cancel the task from within itself. It means we cannot simply call So instead of calling |
@jinusean - will need to rebase with the latest changes, then I think the suggestion from @nirvana-msu can be implemented and we'll be good to go here |
@jinusean - I reverted most of the changes for concurrent snapshot, so once you have it fixed in your codebase you'll need to re-integrate based off of what I have (and reapply the changes I removed). Too many bugs in the binance codebase, so I had to remove that large set of changes so I can push out a release while I wait for these issues to be resolved. |
you can see what I reverted here: b90fa9b |
its been over 3 weeks with no response. I'm going to consider this matter closed. If you wish to revisit the parallel snapshots, please re-open a new PR |
Fixes #671