Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

tomponline
Copy link
Member

@tomponline tomponline commented Feb 10, 2022

There appears to be an issue in dqlite/go-dqlite that means that a connection established via dqliteNetworkDial is not closed down cleanly when the remote member becomes unreachable. This sometimes (approx 50:50) leaves the connection in close-wait state with data in the send queue for circa 15mins until the OS releases the connection.

This causes DB transactions on the non-leader members to fail with errors like:

lxc cluster ls
Error: failed to begin transaction: call exec-sql (budget 9.999963394s): receive: header: EOF

This only seems to happen if the remote member is shutdown cleanly, so that the remote end is cleanly closed, and then the OS is waiting for dqlite/go-dqlite to close the local end of the connection (which it never does) and instead continues to use it resulting in every write failing with an EOF error.

This PR works around the issue by limiting the amount of time an outbound dqlite/raft connection can have data waiting to be sent (similar to https://github.com/lxc/lxd/pull/9416 for inbound dqlite connections). This in turn ensures a stalled connection caused by a network interruption is closed locally after 30s so its not used by dqlite causing queries to fail with EOF error.

The error above still occurs, but is now limited to 30s rather than 15mins.

…n failure in dqliteNetworkDial

This limits the amount of time an outbound dqlite/raft connection can have data waiting to be sent in the event of a network interruption.
This in turn ensures a stalled connection is closed after 30s so its not reused by dqlite causing queries to fail.

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@tomponline
Copy link
Member Author

tomponline commented Feb 10, 2022

@MathieuBordere @freeekanayaka I would appreciate any insight you may have on this issue (as this PR is useful anyway, like we added to the inbound connections for dqlite proxy, but ultimately a workaround for the specific issue). When/how should go-dqlite/dqlite detect that a remote connection is failing and close it locally?

…liteNetworkDial

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@MathieuBordere
Copy link
Contributor

@MathieuBordere @freeekanayaka I would appreciate any insight you may have on this issue (as this PR is useful anyway, like we added to the inbound connections for dqlite proxy, but ultimately a workaround for the specific issue). When/how should go-dqlite/dqlite detect that a remote connection is failing and close it locally?

Do you have a way of reproducing this behaviour, then I can investigate what's going on in dqlite.

@tomponline
Copy link
Member Author

tomponline commented Feb 10, 2022

Do you have a way of reproducing this behaviour, then I can investigate what's going on in dqlite.

The reproducer is fairly easy:

  1. Launch 3 Ubuntu Focal VMs and install the lxd edge snap (sudo snap install lxd --edge).
  2. Setup a 3 node LXD cluster (sudo lxd init).
  3. Once cluster is up and running, shutdown the leader member cleanly (i.e lxc stop and not lxc stop -f).
  4. On the the third member (the one not promoted to leader), check sudo ss -tnp and you should see an outbound LXD 8443 connection to the shutdown member in close-wait state (the logging this PR adds has identified that it is an outbound dqlite and not raft connection).
  5. Wait a few seconds and then try running lxc cluster ls on the non-leader member, it should start giving the error at that point.

This doesn't happen 100% of the time, but trying a couple of times should reproduce it, making sure you always test on a non-leader member.

This issue has appeared since we removed the continuous background job for maintaining the LXD event websocket connections (which was causing every member to query the cluster database for members every 1s), somehow by stopping those continuous queries, this has prevented go-dqlite from detecting/handling a remotely closed connection.

My hunch is that its an issue in driverError in go-dqlite, somehow not handling/detecting the EOF error and then not returning driver.ErrBadConn err that would presumably cause the connection to be taken out of use by dqlite.

@tomponline tomponline changed the title Cluster: Enable TCP user timeouts on outbound dqlite and raft connections Cluster: Fix non-leader failed to begin transaction errors when leader shuts down cleanly Feb 10, 2022
@tomponline tomponline changed the title Cluster: Fix non-leader failed to begin transaction errors when leader shuts down cleanly Cluster: Fix non-leader transaction errors when leader shuts down cleanly Feb 10, 2022
@freeekanayaka
Copy link
Contributor

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

@tomponline
Copy link
Member Author

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah we added it for inbound dqlite connections (and outbound raft connections as a side effect) in https://github.com/lxc/lxd/pull/9416, and the go-dqlite app has had it added recently canonical/go-dqlite#179.

Although @masnax assuming this passes the tests, it would be a good idea to check that the go-dqlite app also sets this for outbound as well as inbound connections (if not already).

@tomponline
Copy link
Member Author

tomponline commented Feb 10, 2022

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah its certainly seems to be a good idea to have the TCP user timeout here anyway for closing stalled connections.
What is odd in this case is that I am not forcefully killing the remote member VM, it is being cleanly shutdown and thus the connection on the non-leader is in close-wait as it received a close from the leader and is waiting for the local app to close the connection (and also remove it from the pool of connections), which doesn't seem to be happening despite writes to the connection causing an EOF error.

@lxc-jenkins
Copy link

Testsuite passed

@stgraber
Copy link
Contributor

@masnax can you make sure we have the timeouts set both ways in go-dqlite and in the cloud repo?

@stgraber stgraber merged commit 45af268 into canonical:master Feb 10, 2022
@tomponline tomponline deleted the tp-dqlite-close branch February 10, 2022 14:23
@freeekanayaka
Copy link
Contributor

I think this fix makes sense, and actually I thought we had this already in place.
Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah its certainly seems to be a good idea to have the TCP user timeout here anyway for closing stalled connections. What is odd in this case is that I am not forcefully killing the remote member VM, it is being cleanly shutdown and thus the connection on the non-leader is in close-wait as it received a close from the leader and is waiting for the local app to close the connection (and also remove it from the pool of connections), which doesn't seem to be happening despite writes to the connection causing an EOF error.

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

@tomponline
Copy link
Member Author

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Feb 10, 2022

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

Yes, could you just create an issue in go-dqlite please?

@tomponline
Copy link
Member Author

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

Yes, could you just create an issue in go-dqlite please?

canonical/go-dqlite#182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants