-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver: Detect EOF in driverError #186
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent thanks!
@MathieuBordere once this is merged, please can you create a new release version and tag for this so we can update LXD's dependencies ready for LXD 5.1? |
With this change you would be able to re-merge the dqlite change that was reverted, right? |
Yes indeed. |
But maybe I should test for EOF at the end of the other error handling? I'm just a bit confused by Line 775 in 5724311
|
I think that it is fine as it is as it will find EOF anywhere in the error chain (it will call Unwrap internally). |
@MathieuBordere can you fix up the test golint issues please (looks like just comment annotation issues) https://github.com/canonical/go-dqlite/pull/186/files "Unchanged files with check annotations Beta " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
It somehow reinforces the feeling that in the future we should also work on proper error propagation vs closing connections, but I'm happy that it solves the problem for now.
Done, the |
You have a good point. Yes, I do think it's a good idea to place this at the end of the chain, in the default statement, and yes you should be fine with the So hopefully something like: default:
// When using a TLS connection, the underlying error might get
// wrapped by the stdlib itself with the new errors wrapping
// conventions available since go 1.13. In that case we check
// the underlying error with Unwrap() instead of Cause().
if root, ok := err.(unwrappable); ok {
err = root.Unwrap()
}
switch err.(type) {
case *net.OpError:
fallthrough
case *io.EOF:
log(client.LogDebug, "network connection lost: %v", err)
return driver.ErrBadConn
} should work. |
One thing that concerns me is that I believe the proxy (both the LXD one and the |
I suppose if the lxd tests pass OK then it should be fine |
368e8c4
to
2322814
Compare
suite is running now, will report back. |
There are also tests in this repo that check that the result set iterators return io.EOF at the end of the rows, so if they pass that will be encouraging. But moving the check to the |
I will mark this as draft for now, and investigate further before quickly patching up the issue uncovered by that dqlite PR, don't want to add to the mess. I will handle the dqlite leadership loss in a different way than closing the connection. |
Which way are you thinking? |
It will close the SQLite leader connection to the database, eliminating the Line 752 in 5724311
|
Looking again at the description of #354, I feel there might be a bit of confusion about what Essentially, In that case the client can't make any assumption about whether the commit was (or will be) actually successful, because there's no way to know if a quorum of nodes had received the entry. This is basically the situation described in section 6.3 (Implementing linearizable semantics) of the Raft dissertation, and the solution indicated there would be to implement client sessions and request IDs in the FSM. With that in place, the client would retry the commit request as soon as it finds a new leader, using sessions and request IDs the new leader would know if it is a brand new commit request, or actually a duplicated one (since the original one got committed despite the client received a Now, after having described what
^^^ at this point what I believe should happen is:
And then, proceeding in the problematic sequence of events:
^^^ at this point the assertion shouldn't be triggered since the server had rolled back that pending transaction. What I described should result in relatively small changes in the |
I understand your reasoning but
while I thought that closing the sqlite connection would be a fool proof way to get rid off all the state introduced by the start of the transaction. |
Yeah, I was going to follow-up since I just realized that. Indeed In this case, instead of using |
Closing the connection altogether, probably works too, but we'd end up in bit of a mixed state in the gateway, which feels potentially a bit fragile, or will always need some care (i.e. we have to know that there are cases where the leader object or SQLite object is not there because it was closed elsewhere). Having a single path for releasing resource usually simplifies things. |
Yeah, I thought about this too, but there is a case where this can lead to inconsistent data I think.
|
Okay, thanks for pointing this out. I actually had also another idea, which I felt was actually better but a bit more subtle. Your point makes me feel that perhaps it's actually the best approach. Basically we would not do anything in the Instead we should turn the The only two things that I'm not totally sure about are:
Point 1. is kind of minor, but something that would be nice to have. |
One possible way to solve point 1. would maybe to use some sort of flag that gets set in the |
I don't really feel too comfortable with that, it feels a bit brittle as I don't fully understand it, I'd rather have SQLite itself do the cleanups for us for whatever state it introduced. I think that a case can be made for the fact that a connection to a leader node that at one point has lost leadership is "busted", shouldn't be reused and cleaned up. I would propose to close the sqlite leader connection and add extra checks in functions that now assume that |
Ok, I'll open a PR for that. |
I'll give this another go, if the LXD test suite is happy, I'll probably merge it. |
Thanks! :) |
Possibly related, possibly to be implemented in conjunction with this change. https://pkg.go.dev/database/sql/driver
|
I've been looking at the mysql go driver a bit, what they do, is implement SessionResetter This will trigger a stale connection check https://github.com/go-sql-driver/mysql/blob/081308f66228fdc51224614d1cf414c918cc1596/packets.go#L106 on the first write on a connection taken from the connection pool in which they detect EOF with https://github.com/go-sql-driver/mysql/blob/081308f66228fdc51224614d1cf414c918cc1596/conncheck.go#L23 The connection will be discarded when e.g. EOF is detected. We could do something similar in our case. Still investigating the effect of the original change proposed in this PR. |
Specifically, will check if the change (and existing code) violates the rules for ErrBadCon layed out in the driver docs
edit: e.g. I think we violate see go-dqlite/internal/protocol/protocol.go Lines 66 to 72 in 407917c
|
2322814
to
25bae79
Compare
Please would you recap what the issue exactly is? Like what conditions lead to it. A reproducer or a unit test would be great too, although I guess that might be tricky. I've tried to read the comments, but I'm still a bit lost. |
The original issue has a reproducer #182 |
Its not as bad as it was since I added TCP user timeouts in LXD but I still see these sorts of errors for a bit when cluster members go down. |
Basically go dqlite keeps offering up a closed connection to be used. |
I'll try to come up with a reproducer in the unit tests, I'll understand it better myself too then. The crux of the problem is that as long as a connection is not marked with The |
6d4973b
to
9273fe5
Compare
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
9273fe5
to
1a24816
Compare
hi @tomponline , I'm having trouble running the LXD test suite on one of Stéphane's MAAS machines with this branch. It's probably more related to me doing something wrong. Could you run the LXD test suite with the contents of this branch please? Might be quicker. |
Testing this now.... |
@MathieuBordere clean test run here using:
|
I'm okay to merge this if @cole-miller or @freeekanayaka don't have objections, we can always revert if something unexpected would happen. This branch also ran fine with the jepsen suite. |
@@ -787,6 +787,12 @@ type unwrappable interface { | |||
Unwrap() error | |||
} | |||
|
|||
// TODO driver.ErrBadConn should not be returned when there's a possibility that | |||
// the query has been executed. In our case there is a window in protocol.Call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably goes without saying, but just to be sure we are on the same page: this issue should be dealt with by implementing session support in the dqlite state machine, as per paragraph 6.3 of the raft dissertation.
Until then, I'm fine converting io.EOF
into driver.ErrBadConn
if it turns out that it solves more problems than it creates (i.e. the chance of returning driver.ErrBadConn
also if a query was executed is low, while the chance that converting io.EOF
into driver.ErrBadConn
solves actual real world issues by retrying is high).
Still, I would have been happier if there was an end-to-end test for this change, showing that we at least precisely understand and can reproduce the exact sequence of events around the issue. With such a test in place it would be easier to avoid regressions later down the road, especially when we'll implement sessions as per 6.3.
I'll merge this tomorrow morning (CET) and make a new release then. |
Fixes #182
Signed-off-by: Mathieu Borderé mathieu.bordere@canonical.com