Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WithExternalConn to app.App #179

Merged
merged 5 commits into from
Feb 5, 2022
Merged

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Feb 1, 2022

This takes a bunch of the logic from the dqlite proxies in lxd's Gateway and applies them to app.App via a WithExternalConn option.

WithExternalConn takes a client.DialFunc and an acceptCh which will be used for back and forth handling of the external connection.

I modified the TLS keep alive options a little to look more like how we handle them in LXD here.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax masnax force-pushed the ext-conn branch 2 times, most recently from 4b01c4f to 8fe4215 Compare February 2, 2022 21:32
@masnax masnax marked this pull request as ready for review February 3, 2022 03:40
Copy link
Contributor

@MathieuBordere MathieuBordere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can you add some tests for the new option though? Something like creating a cluster with a couple of nodes and performing a write and a read on the db.

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Feb 4, 2022

LGTM, can you add some tests for the new option though? Something like creating a cluster with a couple of nodes and performing a write and a read on the db.

Thanks! Could you just add a 3rd node to the cluster in your test please? The reason is that the second node will remain in the cluster as a spare and the raft leader will not attempt to send it raft RPCs, thus not triggering extDialFuncWithProxy that is used by the raft nodes to talk to each other. Adding a 3rd node will trigger the non-leader nodes to be converted to voters which will cause extDialFuncWithProxy to be hit because the leader will start sending heartbeats to the other nodes.

So you should add a 3rd node & wait a bit for the roles to settle.

@masnax
Copy link
Contributor Author

masnax commented Feb 4, 2022

LGTM, can you add some tests for the new option though? Something like creating a cluster with a couple of nodes and performing a write and a read on the db.

Thanks! Could you just add a 3rd node to the cluster in your test please? The reason is that the second node will remain in the cluster as a spare and the raft leader will not attempt to send it raft RPCs, thus not triggering extDialFuncWithProxy that is used by the raft nodes to talk to each other. Adding a 3rd node will trigger the non-leader nodes to be converted to voters which will cause extDialFuncWithProxy to be hit because the leader will start sending heartbeats to the other nodes.

So you should add a 3rd node & wait a bit for the roles to settle.

Ah right, I forgot the spares don't do much :)
Added a third node and now they all come up as voters.

@MathieuBordere
Copy link
Contributor

LGTM, can you add some tests for the new option though? Something like creating a cluster with a couple of nodes and performing a write and a read on the db.

Thanks! Could you just add a 3rd node to the cluster in your test please? The reason is that the second node will remain in the cluster as a spare and the raft leader will not attempt to send it raft RPCs, thus not triggering extDialFuncWithProxy that is used by the raft nodes to talk to each other. Adding a 3rd node will trigger the non-leader nodes to be converted to voters which will cause extDialFuncWithProxy to be hit because the leader will start sending heartbeats to the other nodes.
So you should add a 3rd node & wait a bit for the roles to settle.

Ah right, I forgot the spares don't do much :) Added a third node and now they all come up as voters.

A minor nit about the comment and we should be good.

@masnax
Copy link
Contributor Author

masnax commented Feb 4, 2022

A minor nit about the comment and we should be good.

Fixed.

app/proxy.go Outdated

return tcpConn, nil
}

// Set TCP keepalive with 30 seconds idle time, 3 seconds retry interval with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs updating.

app/proxy.go Outdated
@@ -117,6 +158,11 @@ func setKeepalive(conn *net.TCPConn) error {
if err != nil {
return
}

err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, unix.TCP_USER_TIMEOUT, int(30*time.Second/time.Millisecond))
Copy link
Member

@tomponline tomponline Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment here detailing some of the reason behind its presence (from the comment in LXD's SetTCPTimeouts) would likely be beneficial for future readers, especially focussing on why the keepalive timeout itself is not sufficient.

app/app.go Outdated
remote := <-o.Conn.acceptCh
local, err := net.Dial("unix", nodeBindAddress)
if err != nil {
continue
Copy link
Member

@tomponline tomponline Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are errors ignored here after accepting an inbound request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to be honest, I plucked this out of LXD. In LXD we used to panic here in case of errors, but now we keep retrying.

This is trying to connect to dqlite's bind address so if that fails we probably can't connect to dqlite at all. I think it would be safe to handle the error here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering also what we do with the accepted connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote accepted connection? We just copy any data back and forth between the accepted connection and dqlite's bind address. Did you mean something specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah what happens to it if an error occurs here abd we continue? Should it be closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't see any problem with closing it.

@@ -39,3 +39,21 @@ func makeNodeDialFunc(config *tls.Config) client.DialFunc {

return dial
}

func extDialFuncWithProxy(dialFunc client.DialFunc) client.DialFunc {
return func(ctx context.Context, addr string) (net.Conn, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the function here accept a ctx, but doesn't pass it to the go proxy() function call inside it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks I missed that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually it appears the parent context is already done by the time the dialfunc is run, which seems odd. Looking to find out why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so the parent context starts from the connectWithDial method in internal/bindings/server.go, which is called from c, and this is what executes the DialFunc. The context isn't cancelled until the DialFunc returns, but since we're running the proxy in a goroutine that we want to run as long as there's data between the connections, we don't want the context to be cancelled by the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense, so one is a connect timeout CTX and one is a cancelation context.

What concerns me is why in one call to proxy() a cancelation context is passed as expected, but here the background context is passed meaning it won't get cancelled.

Copy link
Contributor Author

@masnax masnax Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a context to the Node struct that starts/stops the dqlite server, which is then assigned to a variable in the c bindings when the server starts, and is cancelled when the server stops.

This context becomes the parent context for the dial function, which in turn is only cancelled on a timeout or if the dial function returns any errors. If there aren't any errors but the server suddenly stops, the child context will also be cancelled.

app/app_test.go Outdated
assert.Equal(t, externalAddr2, cluster[1].Address)
assert.Equal(t, externalAddr3, cluster[2].Address)

// Initially the node joins as spare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is outdated.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
In LXD, we use a 3 second keep alive timeout, and use the
TCP_USER_TIMEOUT option. As we expect to use the app package in LXD at
some point, I've moved those options here as well.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax masnax force-pushed the ext-conn branch 2 times, most recently from 844a6c1 to 63d9e95 Compare February 4, 2022 21:44
@stgraber
Copy link
Contributor

stgraber commented Feb 4, 2022

@masnax test seems to be getting stuck?

@masnax
Copy link
Contributor Author

masnax commented Feb 4, 2022

@masnax test seems to be getting stuck?

Yeah I just reversed the commit. Didn't like my change to contexts :(

@stgraber stgraber merged commit db8f7c0 into canonical:master Feb 5, 2022
@tomponline
Copy link
Member

@masnax test seems to be getting stuck?

Yeah I just reversed the commit. Didn't like my change to contexts :(

This got merged without a fix for the uncancelable context?

@stgraber
Copy link
Contributor

stgraber commented Feb 5, 2022

Gah, I missed that this is what got reverted, must have been checking an older set of commits against the new github action status. We should still wire up the ctx on the proxy to something other than a generic context.Background (but as mentioned, not pass the dial context through as that's not the right one for this).

@masnax
Copy link
Contributor Author

masnax commented Feb 5, 2022

@masnax test seems to be getting stuck?

Yeah I just reversed the commit. Didn't like my change to contexts :(

This got merged without a fix for the uncancelable context?

Yeah it did, at the very least it's not more broken as the uncancelable context was already there, I just changed the lines above it :)

I dropped the last commit which added a life cycle context to the dqlite server as it was causing 1 test to hang, although I'm not exactly sure if the test was correct in the first place.

@MathieuBordere Maybe you can help clarify, TestRolesAdjustment_CantReplaceVoter in app/app_test.go seems to drop a 3-voter (and 1 stand-by) cluster to a 2-voter cluster, but I thought this should result in loss of quorum? I'm wondering if this test was only passing before because of the uncancelable context. Adding the voter back to the cluster will once again establish quorum and the test will pass.

@MathieuBordere
Copy link
Contributor

@masnax test seems to be getting stuck?

Yeah I just reversed the commit. Didn't like my change to contexts :(

This got merged without a fix for the uncancelable context?

Yeah it did, at the very least it's not more broken as the uncancelable context was already there, I just changed the lines above it :)

I dropped the last commit which added a life cycle context to the dqlite server as it was causing 1 test to hang, although I'm not exactly sure if the test was correct in the first place.

@MathieuBordere Maybe you can help clarify, TestRolesAdjustment_CantReplaceVoter in app/app_test.go seems to drop a 3-voter (and 1 stand-by) cluster to a 2-voter cluster, but I thought this should result in loss of quorum? I'm wondering if this test was only passing before because of the uncancelable context. Adding the voter back to the cluster will once again establish quorum and the test will pass.

So there are 3 voters and 1 standby node, then a voter and the standby go offline. The app logic will in this case not try to reshuffle the roles, the 2 remaining online nodes are voters and still have quorum (2 out of 3).
There's really not much else we can or should do, the standby was there to be promoted to voter in case a voter went offline, but in this test a voter and standby went offline at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants