Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework NetworkManager using tasks #889

Merged
merged 4 commits into from
May 31, 2020
Merged

Conversation

AndrejMitrovic
Copy link
Contributor

Works for me locally, let's see what the CI says.

There might be some missing scope / @safe and so on, I'll add those if neccessary.

Fixes #587

@AndrejMitrovic AndrejMitrovic added the type-enhancement An improvement of existing functionalities label May 29, 2020
@AndrejMitrovic AndrejMitrovic changed the title Rework NetworkManager Rework NetworkManager using tasks May 29, 2020
@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

Merging #889 into v0.x.x will increase coverage by 0.06%.
The diff coverage is 88.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           v0.x.x     #889      +/-   ##
==========================================
+ Coverage   90.34%   90.41%   +0.06%     
==========================================
  Files          70       70              
  Lines        5643     5652       +9     
==========================================
+ Hits         5098     5110      +12     
+ Misses        545      542       -3     
Flag Coverage Δ
#integration 54.77% <85.57%> (-0.14%) ⬇️
#unittests 88.76% <87.50%> (+0.08%) ⬆️
Impacted Files Coverage Δ
source/agora/network/NetworkManager.d 78.28% <86.81%> (+2.84%) ⬆️
source/agora/network/NetworkClient.d 88.88% <100.00%> (ø)
source/agora/node/FullNode.d 84.84% <100.00%> (+0.47%) ⬆️
source/agora/node/Validator.d 75.00% <100.00%> (+2.58%) ⬆️
source/agora/test/Base.d 87.75% <0.00%> (-0.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09bcfe5...f7a0fc5. Read the comment docs.

@AndrejMitrovic
Copy link
Contributor Author

Oh and taskManager.schedule will be very useful to have once we have that. It should make it more efficient. But it's not a strict requirement.

Copy link
Collaborator

@Geod24 Geod24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big diff

Comment on lines 113 to 115
/// The maximum number of connection tasks to use during network discovery
public size_t connection_tasks = 10;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit eerie to expose this in a config file that might be touched by not-so-technical ppl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make it a constant then? I'm not sure which number to use. But I guess we can start with a low number like 10 and then increase it as we see fit?

Comment on lines 176 to 177
scope (success)
this.taskman.wait(5.seconds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the scope (success) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't want to suspend the fiber if an exception is thrown. I want it to kill the fiber instead.

And now that I said that out loud, I realize I should probably make discover nothrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buuuuuuuttttttttt... Why not simply:

while (true)
{
    this.network.discover();
    this.taskman.wait(5.seconds);
}

The wait is in a scope (success) in a loop with a single statement. Either the statement throws and the code is never executed, or the statement doesn't throw and the code is executed. What am I missing ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, my bad. It was copied over from the other example usage where I have a continue. I'll fix it.

source/agora/node/Validator.d Show resolved Hide resolved
source/agora/node/Validator.d Show resolved Hide resolved
@@ -57,31 +57,288 @@ mixin AddLogger!();
/// Ditto
public class NetworkManager
{
/// Node information
private static struct Node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something less ambiguous, like NodeInfo ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that exact same thought. But then I realized we already have that struct, returned in getNodeInfo. I need to think of another name..

key = client.getPublicKey();
break;
}
catch (HTTPStatusException ex) // vibe.d (non-unittests)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in LocalRest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it returns PublicKey.init. This was implemented this way as a workaround when splitting up full node vs validator node, Henry's PR.


***********************************************************************/

public this (void delegate (Set!Address addresses) onNewAddresses )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public this (void delegate (Set!Address addresses) onNewAddresses )
public this (void delegate (Set!Address addresses) onNewAddresses)

Comment on lines +292 to +273
scope (success)
this.outer.taskman.wait(1.seconds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto why scope success ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case it's because of the continue below. I didn't want to duplicate the wait call.

@AndrejMitrovic
Copy link
Contributor Author

AndrejMitrovic commented May 29, 2020

  • Fixed styling
  • Added comment about the special-case in LocalRest
  • scope (success) was used because if the Fiber shuts down, the sleep call is unnecessary
  • Renamed Node => NodeConnInfo (didn't call it NodeInfo to avoid clashes with the existing struct).
  • Removed the config option.
  • Sorry for the big diff.

Copy link
Collaborator

@Geod24 Geod24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Definitely the right directions. I'm not sure I can give this the review it deserves in the time remaining in the sprint, but I really, really want to get this through the door.

Typo in the commit message: "does not [missing 'need' I guess] to be re-triggered"

source/agora/network/NetworkManager.d Outdated Show resolved Hide resolved
source/agora/network/NetworkManager.d Show resolved Hide resolved
Comment on lines 176 to 177
scope (success)
this.taskman.wait(5.seconds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buuuuuuuttttttttt... Why not simply:

while (true)
{
    this.network.discover();
    this.taskman.wait(5.seconds);
}

The wait is in a scope (success) in a loop with a single statement. Either the statement throws and the code is never executed, or the statement doesn't throw and the code is executed. What am I missing ?

Comment on lines 178 to 182
if (this.outer.banman.isBanned(this.address))
{
this.onFailedConnection(this.address);
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inner developer suggests me that we might want to make onFailedConnection returns a boolean telling whether we should retry or abort the request, and that we could move the isBanned check in there.
I don't know how sensible this would be, and I'm not asking you to do it, just raising possibilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is a good idea.

@AndrejMitrovic
Copy link
Contributor Author

Nice. Definitely the right directions. I'm not sure I can give this the review it deserves in the time remaining in the sprint, but I really, really want to get this through the door.

If there's anything missing (I'm sure there is), we can continue work on it. Filing another issue with the rest of the problems that have to be fixed.

@AndrejMitrovic AndrejMitrovic force-pushed the new-netman branch 2 times, most recently from e3e2eda to d0ade1d Compare May 29, 2020 09:07
@AndrejMitrovic
Copy link
Contributor Author

There are random failures due to #887 and also the occasional failure by dub not finding a file. But otherwise it's ready to go.

@AndrejMitrovic
Copy link
Contributor Author

Fixed wrong docs.

@Geod24
Copy link
Collaborator

Geod24 commented May 31, 2020

Full error: core.exception.AssertError@source/agora/consensus/protocol/Nominator.d(372): Assertion failure

This triggers quite a bit...

It was the only API endpoint wrapper which stored
the key to an internal public field, but this is
unnecessary as the cached key is only used in
one scope.
Grouping them together with the other sets.
Using the ban manager with an infinite
ban time is awkward, and further complicates
the situation where the node's IP changes.

The Ban data is persistent and this is problematic
if the Node's IP changes but we still have an
IP in the ban list that is no longer used.
The new NetworkManager now uses separate connection
and address discovery tasks.

The 'connection_tasks' config option has been added
to fine-tune the number of connection tasks a Node
should use.

The discover() method can now be called multiple times,
which is required to be able to support quorum reshuffling.

The address discovery is now an endless task which does
not need to be re-triggered, it runs in the background and
queries each client in turn every 5 seconds.

The periodic catchup routine no longer waits for full
node discovery before attempting to read blocks.
It is unnecessary to have to wait for minPeersConnected()
to be true, catchup can proceed as soon as we have
at least one node connection.

Fixes bosagora#587
@Geod24
Copy link
Collaborator

Geod24 commented May 31, 2020

Rebased on v0.x.x

@Geod24 Geod24 merged commit f8607d0 into bosagora:v0.x.x May 31, 2020
@AndrejMitrovic AndrejMitrovic deleted the new-netman branch May 31, 2020 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-enhancement An improvement of existing functionalities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Network manager needs to be reworked
2 participants