-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891
Conversation
…n failure in dqliteNetworkDial This limits the amount of time an outbound dqlite/raft connection can have data waiting to be sent in the event of a network interruption. This in turn ensures a stalled connection is closed after 30s so its not reused by dqlite causing queries to fail. Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@MathieuBordere @freeekanayaka I would appreciate any insight you may have on this issue (as this PR is useful anyway, like we added to the inbound connections for dqlite proxy, but ultimately a workaround for the specific issue). When/how should go-dqlite/dqlite detect that a remote connection is failing and close it locally? |
…liteNetworkDial Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
Do you have a way of reproducing this behaviour, then I can investigate what's going on in dqlite. |
The reproducer is fairly easy:
This doesn't happen 100% of the time, but trying a couple of times should reproduce it, making sure you always test on a non-leader member. This issue has appeared since we removed the continuous background job for maintaining the LXD event websocket connections (which was causing every member to query the cluster database for members every 1s), somehow by stopping those continuous queries, this has prevented go-dqlite from detecting/handling a remotely closed connection. My hunch is that its an issue in |
I think this fix makes sense, and actually I thought we had this already in place. Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for |
Yeah we added it for inbound dqlite connections (and outbound raft connections as a side effect) in https://github.com/lxc/lxd/pull/9416, and the go-dqlite app has had it added recently canonical/go-dqlite#179. Although @masnax assuming this passes the tests, it would be a good idea to check that the go-dqlite app also sets this for outbound as well as inbound connections (if not already). |
Yeah its certainly seems to be a good idea to have the TCP user timeout here anyway for closing stalled connections. |
Testsuite passed |
@masnax can you make sure we have the timeouts set both ways in go-dqlite and in the cloud repo? |
This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case. |
I think @MathieuBordere is going to look into it. |
Yes, could you just create an issue in go-dqlite please? |
|
There appears to be an issue in dqlite/go-dqlite that means that a connection established via
dqliteNetworkDial
is not closed down cleanly when the remote member becomes unreachable. This sometimes (approx 50:50) leaves the connection in close-wait state with data in the send queue for circa 15mins until the OS releases the connection.This causes DB transactions on the non-leader members to fail with errors like:
This only seems to happen if the remote member is shutdown cleanly, so that the remote end is cleanly closed, and then the OS is waiting for dqlite/go-dqlite to close the local end of the connection (which it never does) and instead continues to use it resulting in every write failing with an EOF error.
This PR works around the issue by limiting the amount of time an outbound dqlite/raft connection can have data waiting to be sent (similar to https://github.com/lxc/lxd/pull/9416 for inbound dqlite connections). This in turn ensures a stalled connection caused by a network interruption is closed locally after 30s so its not used by dqlite causing queries to fail with EOF error.
The error above still occurs, but is now limited to 30s rather than 15mins.