-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[repro, don't merge]: peer comms going crazy on random fake network error #4250
Conversation
7794872
to
e2838f0
Compare
Make client side state machines go brrrr: * no busy looping/polling anything (good for phone bateries) * no canceling anything * parallel state transitions
e2838f0
to
36231f7
Compare
@joschisan @elsirion Any thoughts? It seems to me like alephbft issue and I don't know where to take it from here. |
Hmmm....
|
Ohhh....
I think I know... :D |
Ignore this comment.
So I decreased the exponential_slowdown_offs somewhat, and that makes the check there overflow. I see. I guess we should submit a PR. Overflow should be checked. And we should also put some cap on that exponentiation. |
I tried to increase the outgoing channel by a ton to make sure messages are not being dropped to check if alephbft somehow doesn't like it. Didn't help. There's still a message that's dropped on a simulated disconnection, but then ... BFT protocol can't assume lack of any failures. Are we supposed be running any kind of acknowledgement protocol and keep messages in the buffer until the peer responds that they've seen them? Or is fire and forget that we do OK? Just graphing at straws and double checking. |
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
... until fedimint#4250 is diagned and fixed.
It's possible that the alphbft behavior I'm seeing here is actually normal, and the root of the problems I was seeing was #4329 |
repro esily with:
(might new few tries if you're lucky).
What will you see:
When things go OK, logs are relatively quiet.
When the randomized error is introduced comms are going crazy. I added IN and OUT logs and seems like communidation between peers is working, just there's tons of traffic and things just getting stuck, in particular given that error has a higher chance of being introduced the more traffic is going through the connection, possibly re-introducing the reason for problems periodically.
Notably without refactor: improve client executor loop, the problem seems gone. Which is weird - it should be strictly client-side chagne. It's unclear to me if there's a bug there introducing a problem, or could it be that more aggressive state machine execution is either overloading consensus or something else entirely.
Edit:
In c82edbf I made sure that after initial failures the peers have time to recover any communication without further failures. But it doesn't help. The test fails like all other times anyway.
I wonder/suspect that either something about aleph item period/timeouts is wrong, or it has a bug.