-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WithExternalConn to app.App #179
Conversation
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
4b01c4f
to
8fe4215
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, can you add some tests for the new option though? Something like creating a cluster with a couple of nodes and performing a write and a read on the db.
Thanks! Could you just add a 3rd node to the cluster in your test please? The reason is that the second node will remain in the cluster as a spare and the raft leader will not attempt to send it raft RPCs, thus not triggering So you should add a 3rd node & wait a bit for the roles to settle. |
Ah right, I forgot the spares don't do much :) |
A minor nit about the comment and we should be good. |
Fixed. |
app/proxy.go
Outdated
|
||
return tcpConn, nil | ||
} | ||
|
||
// Set TCP keepalive with 30 seconds idle time, 3 seconds retry interval with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs updating.
app/proxy.go
Outdated
@@ -117,6 +158,11 @@ func setKeepalive(conn *net.TCPConn) error { | |||
if err != nil { | |||
return | |||
} | |||
|
|||
err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, unix.TCP_USER_TIMEOUT, int(30*time.Second/time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment here detailing some of the reason behind its presence (from the comment in LXD's SetTCPTimeouts
) would likely be beneficial for future readers, especially focussing on why the keepalive timeout itself is not sufficient.
app/app.go
Outdated
remote := <-o.Conn.acceptCh | ||
local, err := net.Dial("unix", nodeBindAddress) | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are errors ignored here after accepting an inbound request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to be honest, I plucked this out of LXD. In LXD we used to panic here in case of errors, but now we keep retrying.
This is trying to connect to dqlite's bind address so if that fails we probably can't connect to dqlite at all. I think it would be safe to handle the error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering also what we do with the accepted connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remote accepted connection? We just copy any data back and forth between the accepted connection and dqlite's bind address. Did you mean something specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah what happens to it if an error occurs here abd we continue? Should it be closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't see any problem with closing it.
@@ -39,3 +39,21 @@ func makeNodeDialFunc(config *tls.Config) client.DialFunc { | |||
|
|||
return dial | |||
} | |||
|
|||
func extDialFuncWithProxy(dialFunc client.DialFunc) client.DialFunc { | |||
return func(ctx context.Context, addr string) (net.Conn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the function here accept a ctx, but doesn't pass it to the go proxy()
function call inside it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks I missed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually it appears the parent context is already done by the time the dialfunc is run, which seems odd. Looking to find out why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so the parent context starts from the connectWithDial
method in internal/bindings/server.go
, which is called from c
, and this is what executes the DialFunc. The context isn't cancelled until the DialFunc returns, but since we're running the proxy in a goroutine that we want to run as long as there's data between the connections, we don't want the context to be cancelled by the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK makes sense, so one is a connect timeout CTX and one is a cancelation context.
What concerns me is why in one call to proxy() a cancelation context is passed as expected, but here the background context is passed meaning it won't get cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added a context to the Node
struct that starts/stops the dqlite server, which is then assigned to a variable in the c
bindings when the server starts, and is cancelled when the server stops.
This context becomes the parent context for the dial function, which in turn is only cancelled on a timeout or if the dial function returns any errors. If there aren't any errors but the server suddenly stops, the child context will also be cancelled.
app/app_test.go
Outdated
assert.Equal(t, externalAddr2, cluster[1].Address) | ||
assert.Equal(t, externalAddr3, cluster[2].Address) | ||
|
||
// Initially the node joins as spare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is outdated.
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
In LXD, we use a 3 second keep alive timeout, and use the TCP_USER_TIMEOUT option. As we expect to use the app package in LXD at some point, I've moved those options here as well. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
844a6c1
to
63d9e95
Compare
@masnax test seems to be getting stuck? |
Yeah I just reversed the commit. Didn't like my change to contexts :( |
This got merged without a fix for the uncancelable context? |
Gah, I missed that this is what got reverted, must have been checking an older set of commits against the new github action status. We should still wire up the ctx on the proxy to something other than a generic context.Background (but as mentioned, not pass the dial context through as that's not the right one for this). |
Yeah it did, at the very least it's not more broken as the uncancelable context was already there, I just changed the lines above it :) I dropped the last commit which added a life cycle context to the dqlite server as it was causing 1 test to hang, although I'm not exactly sure if the test was correct in the first place. @MathieuBordere Maybe you can help clarify, |
So there are 3 voters and 1 standby node, then a voter and the standby go offline. The app logic will in this case not try to reshuffle the roles, the 2 remaining online nodes are voters and still have quorum (2 out of 3). |
This takes a bunch of the logic from the dqlite proxies in lxd's
Gateway
and applies them toapp.App
via aWithExternalConn
option.WithExternalConn
takes aclient.DialFunc
and anacceptCh
which will be used for back and forth handling of the external connection.I modified the TLS keep alive options a little to look more like how we handle them in LXD here.