Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests is sending requests to 0.0.0.0 #1051

Closed
Geod24 opened this issue Jul 22, 2020 · 22 comments · Fixed by #1085
Closed

Integration tests is sending requests to 0.0.0.0 #1051

Geod24 opened this issue Jul 22, 2020 · 22 comments · Fixed by #1085
Assignees
Labels
C. Network Communication An issue which is related to network communication Story-Points:13 This takes > 7 days to complete type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense
Milestone

Comments

@Geod24
Copy link
Collaborator

Geod24 commented Jul 22, 2020

Just do a docker-compose up and see the logs:

node-3_1  | 2020-07-22 08:03:04,541 Trace [agora.network.NetworkClient] - Request 'getPublicKey' to http://0.0.0.0:3826 failed: Failed to connect to 0.0.0.0:3826: refused
node-3_1  | 2020-07-22 08:03:04,541 Trace [agora.network.NetworkClient] - Request 'getPublicKey' to http://0.0.0.0:2826 failed: Failed to connect to 0.0.0.0:2826: refused
node-3_1  | 2020-07-22 08:03:06,43 Trace [agora.network.NetworkClient] - Request 'getPublicKey' to http://0.0.0.0:4826 failed: Failed to connect to 0.0.0.0:4826: refused
node-3_1  | 2020-07-22 08:03:06,43 Trace [agora.network.NetworkClient] - Request 'getPublicKey' to http://0.0.0.0:3826 failed: Failed to connect to 0.0.0.0:3826: refused
node-3_1  | 2020-07-22 08:03:06,44 Trace [agora.network.NetworkClient] - Request 'getPublicKey' to http://0.0.0.0:2826 failed: Failed to connect to 0.0.0.0:2826: refused

I don't see why we would ever send a message to 0.0.0.0 ?

@Geod24 Geod24 added type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense C. Network Communication An issue which is related to network communication labels Jul 22, 2020
@Geod24 Geod24 added this to the 2. Validator milestone Jul 22, 2020
@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 26, 2020

Here is a stack trace:

632(7): object.Exception@source/agora/network/NetworkManager.d(706): Wrong URL instantiated
----------------
??:? agora.api.Validator.API agora.network.NetworkManager.NetworkManager.getClient(immutable(char)[], core.time.Duration) [0x10d68ce81]
??:? void agora.network.NetworkManager.NetworkManager.ConnectionTask.connect() [0x10d68c765]
??:? void vibe.core.task.TaskFuncInfo.set!(void delegate()).set(ref void delegate()).callDelegate(ref vibe.core.task.TaskFuncInfo) [0x10d78461b]
??:? void vibe.core.task.TaskFuncInfo.call() [0x10dbec2f1]
??:? nothrow void vibe.core.task.TaskFiber.run() [0x10dbebb33]
??:? fiber_entryPoint [0x10ddef54b]

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 27, 2020

Better trace:

core.exception.AssertError@source/agora/network/NetworkManager.d(506): Assertion failure
----------------
??:? _d_assert [0x103de5097]
??:? void agora.network.NetworkManager.NetworkManager.addAddress(immutable(char)[]) [0x10368cad7]
??:? int agora.network.NetworkManager.NetworkManager.addAddresses(agora.common.Set.Set!(immutable(char)[]).Set).__foreachbody2(immutable(char)[]) [0x10368cc3f]
??:? int agora.common.Set.Set!(immutable(char)[]).Set.opApply(scope int delegate(immutable(char)[])) [0x10368cbea]
??:? void agora.network.NetworkManager.NetworkManager.addAddresses(agora.common.Set.Set!(immutable(char)[]).Set) [0x10368bdcb]
??:? agora.network.NetworkManager.NetworkManager agora.network.NetworkManager.NetworkManager.__ctor(const(agora.common.Config.NodeConfig), const(agora.common.BanManager.BanManager.Config), const(immutable(char)[][]), const(immutable(char)[][]), agora.common.Metadata.Metadata, agora.common.Task.TaskManager) [0x103685037]
??:? agora.network.NetworkManager.NetworkManager agora.node.FullNode.FullNode.getNetworkManager(const(agora.common.Config.NodeConfig), const(agora.common.BanManager.BanManager.Config), const(immutable(char)[][]), const(immutable(char)[][]), agora.common.Metadata.Metadata, agora.common.Task.TaskManager) [0x10367b7a1]
??:? agora.node.FullNode.FullNode agora.node.FullNode.FullNode.__ctor(const(agora.common.Config.Config), void delegate() nothrow @trusted) [0x103678425]
??:? agora.node.Validator.Validator agora.node.Validator.Validator.__ctor(const(agora.common.Config.Config)) [0x103677edb]
??:? agora.node.FullNode.FullNode agora.node.Runner.runNode(agora.common.Config.Config) [0x103677b5a]
??:? agora.node.FullNode.FullNode agora.node.main.main(immutable(char)[][]).__lambda2() [0x1036778d8]
??:? void vibe.core.task.TaskFuncInfo.set!(agora.node.FullNode.FullNode delegate()).set(ref agora.node.FullNode.FullNode delegate()).callDelegate(ref vibe.core.task.TaskFuncInfo) [0x1037fc00b]
??:? void vibe.core.task.TaskFuncInfo.call() [0x103bea421]
??:? nothrow void vibe.core.task.TaskFiber.run() [0x103be9c63]
??:? fiber_entryPoint [0x103ded74b]

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 27, 2020

Urgh, this is because of registerListener.
We have this function: https://github.com/bpfkorea/agora/blob/a108551600f3991726364cd90cb303a27cabc324/source/agora/network/NetworkManager.d#L819-L831
Notice that it's using NodeConfig.address, which is not the node address. It is a netmask, most of the time 0.0.0.0.
This getAddress function is used when a handshake is completed: https://github.com/bpfkorea/agora/blob/a108551600f3991726364cd90cb303a27cabc324/source/agora/network/NetworkManager.d#L403-L404

This means that we register 0.0.0.0 as listener, which the node can never reach. Not to mention registerListener is an obvious DoS attack because one can register addresses of server to make the network DoS that target. We need to re-think our strategy here.

@AndrejMitrovic
Copy link
Contributor

Copying my comments from slack:

It was a workaround, not a permanent solution.

Actually it was only a workaround because the quorum sets were not known. We could remove it entirely now that we know which nodes listen to each other.

@AndrejMitrovic
Copy link
Contributor

Actually, is there a way to get the connecting address in vibe.d? Then it could just become register() without the parameter.

@AndrejMitrovic
Copy link
Contributor

Although for localrest it's a different story.

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 27, 2020

Yeah I think we should move to "establish connection" instead of "register listener".

Although for localrest it's a different story.

Yeah it just works.

@AndrejMitrovic
Copy link
Contributor

How does it "just work"? There's no way of getting the "address" of the connecting node.

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 27, 2020

Command.sender

@AndrejMitrovic
Copy link
Contributor

I mean we will still need some way of feeding that into Agora.

@bpalaggi bpalaggi added the Story-Points:13 This takes > 7 days to complete label Jul 28, 2020
@AndrejMitrovic
Copy link
Contributor

I thought this might actually be simpler than I originally thought. Shouldn't it be possible to:

  • Remove registerListener from the interface
  • Keep registerListener as a public method in FullNode
  • Register a path manually in vibe.d, something like:
router.route("/register_listener")
    .post((scope HTTPServerRequest req, scope HTTPServerResponse res)
    {
        import std.format;
        string addr = format("%s:%s", req.clientAddress.toAddressString(),
            req.clientAddress.port());
        node.registerListener(addr);
        res.statusCode = 200;
        res.writeVoidBody();
    });

Calling this before registerRestInterface. Would this work?

@AndrejMitrovic
Copy link
Contributor

But it was just an experiment, it's likely missing a lot of stuff.

@linked0
Copy link
Contributor

linked0 commented Jul 29, 2020

I can also see the logs as follows with the command, docker-compose up.

node-1_1  | [main(s17F) INF] Listening for requests on http://0.0.0.0:3826/
node-2_1  | Task terminated with uncaught exception: Request failure to http://0.0.0.0:3826 after 5 attempts
node-0_1  | Task terminated with uncaught exception: Request failure to http://0.0.0.0:4826 after 5 attempts
node-0_1  | Task terminated with uncaught exception: Request failure to http://0.0.0.0:3826 after 5 attempts
node-3_1  | Task terminated with uncaught exception: Request failure to http://0.0.0.0:3826 after 5 attempts

The goal of this issue is that the NetworkManager doesn't have invalid addresses like 0.0.0.0.

I think @AndrejMitrovic 's suggestion which is using vibe.d would be a good start. @AndrejMitrovic , Thank you for your help.
After writing code as he suggested, I will open a PR and also be searching for other solutions.

@AndrejMitrovic
Copy link
Contributor

It would be good to get feedback from @Geod24 because he's used and worked on vibe.d for much longer than I have. He probably knows a better way to do this.

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 29, 2020

My preference would be to have a connection established between nodes that is kept alive, and have all requests going through it. That would completely alleviate the need for register listener, but it needs quite some work on the Vibe.d side.

The first step is to establish a connection and keep it alive, and registered on both sides. That should be enough for a node to "establish" a listener. and whenever it gets something it needs to gossip, it can do so on the connection directly. No need to use the same connection for queries at the beginning.

The second step is to separate the validator and full nodes connections into two pools. We always need to communicate with (all) validators, but full nodes are common, and we might want a smarter strategy to talk to them (e.gg. round robin). In order to do this separation, we need to add an optional handshake phase, where nodes send their public key, if any. Remember that it could be 2 validators, 2 full nodes, or 1 full node 1 validator.

To further this, we might want to add the ability to promote / demote a connection from one pool to the other, e.g. when a validator's enrollment start / expire. We surely need to remove those who expire from our set, or at least demote them.

The last step would be to change things so that we can use a connection with a RestInterfaceClient.

At the moment we can just get away with step 1 (for this very issue).

@AndrejMitrovic
Copy link
Contributor

My preference would be to have a connection established between nodes that is kept alive, and have all requests going through it.

I think that's what vibe.d already does under the hood? At least that's my impression from reading through RestInterfaceClient / requestHTTP / connectHTTP.

We probably need to change defaultKeepAliveTimeout to a sensible default (it's set to 10 seconds).

@AndrejMitrovic
Copy link
Contributor

However vibe.d does keep its own internal pool for all connections, and I'm not sure how easy it is to override this with our own connection pools. There's lots of things in the client which are protected which appear to make it customizable, but it's a guess how well this would work.

@Geod24
Copy link
Collaborator Author

Geod24 commented Jul 29, 2020

I think that's what vibe.d already does under the hood?

Yes but we have no way to tap into that. And we can't rely on it, because the pool has an eviction strategy. You don't want to loose a connection with a validator because suddenly an attacker is sending you a burst of connections / messages and you just evicted this other connection which wasn't as active.

@AndrejMitrovic
Copy link
Contributor

Yes indeed.

@linked0
Copy link
Contributor

linked0 commented Jul 29, 2020

From reading your comments, we need to manage connections in two types of pools, like validator_conns, fullnode_conns. Before that, I guess we also need to get the right information about connections not like 0.0.0.0.. Which is the right way to get the information between overriding vibe.d's internal pool and registering a path manually in vibe.d as @AndrejMitrovic suggested?

@linked0
Copy link
Contributor

linked0 commented Aug 4, 2020

I found out that the defaultKeepAliveTimeout of the HTTPClientSettings and the keepAliveTimeout of the HTTPServerSettings do not support the TCP Keepalive. Please refer to the vibe.d code below.

Calculating the m_keepAliveLimit from the m_keepAliveTimeout in HTTPClientSettings

Using the m_keepAliveLimit in HTTPClientSettings

Using the m_keepAliveTimeout in HTTPServerSettings

I think we should think of other approaches.

@Geod24
Copy link
Collaborator Author

Geod24 commented Aug 4, 2020

Note: This is a pretty good resource: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/index.html

@Geod24 Geod24 linked a pull request Aug 7, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C. Network Communication An issue which is related to network communication Story-Points:13 This takes > 7 days to complete type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants