Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

tomponline · 2022-02-10T11:40:44Z

There appears to be an issue in dqlite/go-dqlite that means that a connection established via dqliteNetworkDial is not closed down cleanly when the remote member becomes unreachable. This sometimes (approx 50:50) leaves the connection in close-wait state with data in the send queue for circa 15mins until the OS releases the connection.

This causes DB transactions on the non-leader members to fail with errors like:

lxc cluster ls
Error: failed to begin transaction: call exec-sql (budget 9.999963394s): receive: header: EOF

This only seems to happen if the remote member is shutdown cleanly, so that the remote end is cleanly closed, and then the OS is waiting for dqlite/go-dqlite to close the local end of the connection (which it never does) and instead continues to use it resulting in every write failing with an EOF error.

This PR works around the issue by limiting the amount of time an outbound dqlite/raft connection can have data waiting to be sent (similar to https://github.com/lxc/lxd/pull/9416 for inbound dqlite connections). This in turn ensures a stalled connection caused by a network interruption is closed locally after 30s so its not used by dqlite causing queries to fail with EOF error.

The error above still occurs, but is now limited to 30s rather than 15mins.

…n failure in dqliteNetworkDial This limits the amount of time an outbound dqlite/raft connection can have data waiting to be sent in the event of a network interruption. This in turn ensures a stalled connection is closed after 30s so its not reused by dqlite causing queries to fail. Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

tomponline · 2022-02-10T11:41:36Z

@MathieuBordere @freeekanayaka I would appreciate any insight you may have on this issue (as this PR is useful anyway, like we added to the inbound connections for dqlite proxy, but ultimately a workaround for the specific issue). When/how should go-dqlite/dqlite detect that a remote connection is failing and close it locally?

…liteNetworkDial Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

MathieuBordere · 2022-02-10T12:18:11Z

@MathieuBordere @freeekanayaka I would appreciate any insight you may have on this issue (as this PR is useful anyway, like we added to the inbound connections for dqlite proxy, but ultimately a workaround for the specific issue). When/how should go-dqlite/dqlite detect that a remote connection is failing and close it locally?

Do you have a way of reproducing this behaviour, then I can investigate what's going on in dqlite.

tomponline · 2022-02-10T12:23:53Z

Do you have a way of reproducing this behaviour, then I can investigate what's going on in dqlite.

The reproducer is fairly easy:

Launch 3 Ubuntu Focal VMs and install the lxd edge snap (sudo snap install lxd --edge).
Setup a 3 node LXD cluster (sudo lxd init).
Once cluster is up and running, shutdown the leader member cleanly (i.e lxc stop and not lxc stop -f).
On the the third member (the one not promoted to leader), check sudo ss -tnp and you should see an outbound LXD 8443 connection to the shutdown member in close-wait state (the logging this PR adds has identified that it is an outbound dqlite and not raft connection).
Wait a few seconds and then try running lxc cluster ls on the non-leader member, it should start giving the error at that point.

This doesn't happen 100% of the time, but trying a couple of times should reproduce it, making sure you always test on a non-leader member.

This issue has appeared since we removed the continuous background job for maintaining the LXD event websocket connections (which was causing every member to query the cluster database for members every 1s), somehow by stopping those continuous queries, this has prevented go-dqlite from detecting/handling a remotely closed connection.

My hunch is that its an issue in driverError in go-dqlite, somehow not handling/detecting the EOF error and then not returning driver.ErrBadConn err that would presumably cause the connection to be taken out of use by dqlite.

freeekanayaka · 2022-02-10T12:41:56Z

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

tomponline · 2022-02-10T12:45:19Z

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah we added it for inbound dqlite connections (and outbound raft connections as a side effect) in https://github.com/lxc/lxd/pull/9416, and the go-dqlite app has had it added recently canonical/go-dqlite#179.

Although @masnax assuming this passes the tests, it would be a good idea to check that the go-dqlite app also sets this for outbound as well as inbound connections (if not already).

tomponline · 2022-02-10T12:47:06Z

I think this fix makes sense, and actually I thought we had this already in place.

Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah its certainly seems to be a good idea to have the TCP user timeout here anyway for closing stalled connections.
What is odd in this case is that I am not forcefully killing the remote member VM, it is being cleanly shutdown and thus the connection on the non-leader is in close-wait as it received a close from the leader and is waiting for the local app to close the connection (and also remove it from the pool of connections), which doesn't seem to be happening despite writes to the connection causing an EOF error.

lxc-jenkins · 2022-02-10T13:59:10Z

Testsuite passed

stgraber · 2022-02-10T14:23:15Z

@masnax can you make sure we have the timeouts set both ways in go-dqlite and in the cloud repo?

freeekanayaka · 2022-02-10T14:49:51Z

I think this fix makes sense, and actually I thought we had this already in place.
Essentially all network connections should be configured with the keep-alive and user-timeout settings. I believe it's already the case for go-dqlite/app, but maybe it was not properly done for LXD? It might be worth going through both go-dqlite/app and LXD to ensure that all places where a network connection is established also apply these settings.

Yeah its certainly seems to be a good idea to have the TCP user timeout here anyway for closing stalled connections. What is odd in this case is that I am not forcefully killing the remote member VM, it is being cleanly shutdown and thus the connection on the non-leader is in close-wait as it received a close from the leader and is waiting for the local app to close the connection (and also remove it from the pool of connections), which doesn't seem to be happening despite writes to the connection causing an EOF error.

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

tomponline · 2022-02-10T15:59:01Z

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

MathieuBordere · 2022-02-10T16:05:01Z

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

Yes, could you just create an issue in go-dqlite please?

tomponline · 2022-02-10T16:25:35Z

This sounds fishy indeed, and should be ideally investigated. Although I understand it might be a low-priority "problem". Maybe it's still worth opening an issue to get to the bottom of this, and have it reliably work without the need of timeouts for the graceful shutdown case.

I think @MathieuBordere is going to look into it.

Yes, could you just create an issue in go-dqlite please?

canonical/go-dqlite#182

tomponline added 2 commits February 10, 2022 11:35

lxd/cluster/gateway: Improve logging in dqliteNetworkDial

7cb3cb0

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

tomponline self-assigned this Feb 10, 2022

tomponline requested review from freeekanayaka, MathieuBordere and stgraber February 10, 2022 11:40

lxd/cluster/gateway: Standardise logging naming of dqliteProxy and dq…

a03e5e3

…liteNetworkDial Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

tomponline changed the title ~~Cluster: Enable TCP user timeouts on outbound dqlite and raft connections~~ Cluster: Fix non-leader failed to begin transaction errors when leader shuts down cleanly Feb 10, 2022

tomponline changed the title ~~Cluster: Fix non-leader failed to begin transaction errors when leader shuts down cleanly~~ Cluster: Fix non-leader transaction errors when leader shuts down cleanly Feb 10, 2022

stgraber merged commit 45af268 into canonical:master Feb 10, 2022

tomponline deleted the tp-dqlite-close branch February 10, 2022 14:23

tomponline mentioned this pull request Feb 10, 2022

Outbound connection does not appear to be closed/removed from use when remote side closes it canonical/go-dqlite#182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

tomponline commented Feb 10, 2022 •

edited

Loading

tomponline commented Feb 10, 2022 •

edited

Loading

MathieuBordere commented Feb 10, 2022

tomponline commented Feb 10, 2022 •

edited

Loading

freeekanayaka commented Feb 10, 2022

tomponline commented Feb 10, 2022

tomponline commented Feb 10, 2022 •

edited

Loading

lxc-jenkins commented Feb 10, 2022

stgraber commented Feb 10, 2022

freeekanayaka commented Feb 10, 2022

tomponline commented Feb 10, 2022

MathieuBordere commented Feb 10, 2022 •

edited

Loading

tomponline commented Feb 10, 2022

Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

Cluster: Fix non-leader transaction errors when leader shuts down cleanly #9891

Conversation

tomponline commented Feb 10, 2022 • edited Loading

tomponline commented Feb 10, 2022 • edited Loading

MathieuBordere commented Feb 10, 2022

tomponline commented Feb 10, 2022 • edited Loading

freeekanayaka commented Feb 10, 2022

tomponline commented Feb 10, 2022

tomponline commented Feb 10, 2022 • edited Loading

lxc-jenkins commented Feb 10, 2022

stgraber commented Feb 10, 2022

freeekanayaka commented Feb 10, 2022

tomponline commented Feb 10, 2022

MathieuBordere commented Feb 10, 2022 • edited Loading

tomponline commented Feb 10, 2022

tomponline commented Feb 10, 2022 •

edited

Loading

tomponline commented Feb 10, 2022 •

edited

Loading

tomponline commented Feb 10, 2022 •

edited

Loading

tomponline commented Feb 10, 2022 •

edited

Loading

MathieuBordere commented Feb 10, 2022 •

edited

Loading