Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing blocks due to high latency of maintenance block #504

Open
abitmore opened this issue Nov 24, 2017 · 9 comments
Open

Missing blocks due to high latency of maintenance block #504

abitmore opened this issue Nov 24, 2017 · 9 comments

Comments

@abitmore
Copy link
Member

According to the design, there are much more work to be done at maintenance interval, so it's expected that latency of those maintenance blocks will be higher. It's exactly what happening in the production network, logs attached below. At every maintenance interval, 3 slots will be skipped, that means we have 12 seconds for a witness to generate a maintenance block and broadcast it to the next witness and the next witness should be able to validate it before generating the next block.

However, if the next witness didn't receive the maintenance block in 3 seconds, according to current implementation of witness plugin, she will consider that block as lost, so she will generate another maintenance block. This behavior is debatable.

  • if the block is really lost, it's good for the whole network if she generate another block, so the next witness will have 9 seconds (or 12?) to validate it and generate the next block. The total block interval will be shortest.
  • if the block really need 3 seconds to generate, her new block will likely reach the next witness later than the original one, so her block will be ignored, she will miss her block.
  • a worse scenario is that active witness list and/or witness schedule can change after the maintenance block, so the next witness after different maintenance block can be different, so it could cause forking (see this link for example).

Thoughts?

Related logs:

2017-11-23T18:00:03 th_a:Witness Block Production block_production_loo ] Generated block #22049281 with timestamp 2017-11-23T18:00:03 at time 2017-11-23T18:00:03                       witness.cpp:184
2017-11-23T18:00:03 th_a:invoke handle_block         handle_block ] Got block: #22049281 time: 2017-11-23T18:00:00 latency: 3267 ms from: verbaltech2  irreversible: 22049263 (-18)                     application.cpp:499
...
2017-11-23T19:00:00 th_a:invoke handle_block         handle_block ] Got block: #22050477 time: 2017-11-23T19:00:00 latency: 833 ms from: bhuz  irreversible: 22050458 (-19)                     application.cpp:499
...
2017-11-23T20:00:01 th_a:invoke handle_block         handle_block ] Got block: #22051674 time: 2017-11-23T20:00:00 latency: 1554 ms from: delegate.baozi  irreversible: 22051655 (-19)                  application.cpp:499
...
2017-11-23T21:00:01 th_a:invoke handle_block         handle_block ] Got block: #22052871 time: 2017-11-23T21:00:00 latency: 1235 ms from: wackou  irreversible: 22052855 (-16)                  application.cpp:499
...
2017-11-23T22:00:02 th_a:invoke handle_block         handle_block ] Got block: #22054068 time: 2017-11-23T22:00:00 latency: 2001 ms from: witness.still  irreversible: 22054049 (-19)                   application.cpp:499
...
2017-11-23T23:00:01 th_a:invoke handle_block         handle_block ] Got block: #22055265 time: 2017-11-23T23:00:00 latency: 1452 ms from: delegate.freedom  irreversible: 22055246 (-19)                        application.cpp:499
...
2017-11-24T00:00:03 th_a:invoke handle_block         handle_block ] Got block: #22056462 time: 2017-11-24T00:00:00 latency: 3104 ms from: verbaltech2  irreversible: 22056442 (-20)                     application.cpp:499
2017-11-24T00:00:03 th_a:invoke handle_block         handle_block ] Got block: #22056462 time: 2017-11-24T00:00:00 latency: 3372 ms from: verbaltech2  irreversible: 22056442 (-20)                     application.cpp:499
...
2017-11-24T01:00:01 th_a:invoke handle_block         handle_block ] Got block: #22057659 time: 2017-11-24T01:00:00 latency: 1214 ms from: xeldal  irreversible: 22057643 (-16)                  application.cpp:499
...
2017-11-24T02:00:02 th_a:invoke handle_block         handle_block ] Got block: #22058856 time: 2017-11-24T02:00:00 latency: 2005 ms from: witness.still  irreversible: 22058837 (-19)                   application.cpp:499
...
2017-11-24T03:00:01 th_a:invoke handle_block         handle_block ] Got block: #22060053 time: 2017-11-24T03:00:00 latency: 1040 ms from: blckchnd  irreversible: 22060036 (-17)                        application.cpp:499
...
2017-11-24T04:00:01 th_a:invoke handle_block         handle_block ] Got block: #22061250 time: 2017-11-24T04:00:00 latency: 1161 ms from: xn-delegate  irreversible: 22061232 (-18)                     application.cpp:499
...
2017-11-24T05:00:01 th_a:invoke handle_block         handle_block ] Got block: #22062447 time: 2017-11-24T05:00:00 latency: 1624 ms from: xman  irreversible: 22062431 (-16)                    application.cpp:499
...
2017-11-24T06:00:02 th_a:invoke handle_block         handle_block ] Got block: #22063644 time: 2017-11-24T06:00:00 latency: 2891 ms from: verbaltech2  irreversible: 22063629 (-15)                     application.cpp:499
2017-11-24T06:00:03 th_a:invoke handle_block         handle_block ] Got block: #22063644 time: 2017-11-24T06:00:00 latency: 3144 ms from: verbaltech2  irreversible: 22063630 (-14)                     application.cpp:499
...
2017-11-24T07:00:01 th_a:invoke handle_block         handle_block ] Got block: #22064840 time: 2017-11-24T07:00:00 latency: 1210 ms from: in.abit  irreversible: 22064823 (-17)                 application.cpp:499
...
2017-11-24T08:00:00 th_a:invoke handle_block         handle_block ] Got block: #22066037 time: 2017-11-24T08:00:00 latency: 867 ms from: roelandp  irreversible: 22066020 (-17)                 application.cpp:499
...
2017-11-24T09:00:01 th_a:invoke handle_block         handle_block ] Got block: #22067234 time: 2017-11-24T09:00:00 latency: 1563 ms from: abc123  irreversible: 22067216 (-18)                  application.cpp:499
2017-11-24T09:00:01 th_a:invoke handle_block         handle_block ] Got block: #22067234 time: 2017-11-24T09:00:00 latency: 1835 ms from: abc123  irreversible: 22067217 (-17)                  application.cpp:499
...
2017-11-24T10:00:01 th_a:invoke handle_block         handle_block ] Got block: #22068431 time: 2017-11-24T10:00:00 latency: 1061 ms from: roelandp  irreversible: 22068417 (-14)                        application.cpp:499
...
2017-11-24T11:00:01 th_a:invoke handle_block         handle_block ] Got block: #22069628 time: 2017-11-24T11:00:00 latency: 1617 ms from: delegate.baozi  irreversible: 22069611 (-17)                  application.cpp:499
...
2017-11-24T12:00:00 th_a:invoke handle_block         handle_block ] Got block: #22070825 time: 2017-11-24T12:00:00 latency: 960 ms from: roelandp  irreversible: 22070809 (-16)                 application.cpp:499
2017-11-24T12:00:01 th_a:invoke handle_block         handle_block ] Got block: #22070825 time: 2017-11-24T12:00:00 latency: 1230 ms from: roelandp  irreversible: 22070810 (-15)                        application.cpp:499
...
2017-11-24T13:00:03 th_a:Witness Block Production block_production_loo ] Generated block #22072022 with timestamp 2017-11-24T13:00:03 at time 2017-11-24T13:00:03                       witness.cpp:184
2017-11-24T13:00:03 th_a:invoke handle_block         handle_block ] Got block: #22072022 time: 2017-11-24T13:00:00 latency: 3269 ms from: verbaltech2  irreversible: 22072007 (-15)                     application.cpp:499
2017-11-24T13:00:03 th_a:invoke handle_block         handle_block ] Got block: #22072022 time: 2017-11-24T13:00:00 latency: 3269 ms from: verbaltech2  irreversible: 22072007 (-15)                     application.cpp:499
...
2017-11-24T14:00:00 th_a:invoke handle_block         handle_block ] Got block: #22073218 time: 2017-11-24T14:00:00 latency: 955 ms from: delegate-1.lafona  irreversible: 22073201 (-17)                        application.cpp:499
...
2017-11-24T15:00:00 th_a:invoke handle_block         handle_block ] Got block: #22074415 time: 2017-11-24T15:00:00 latency: 940 ms from: delegate-1.lafona  irreversible: 22074400 (-15)                        application.cpp:499
...
2017-11-24T16:00:00 th_a:invoke handle_block         handle_block ] Got block: #22075610 time: 2017-11-24T16:00:00 latency: 933 ms from: delegate-1.lafona  irreversible: 22075593 (-17)                        application.cpp:499
...
2017-11-24T17:00:01 th_a:invoke handle_block         handle_block ] Got block: #22076807 time: 2017-11-24T17:00:00 latency: 1162 ms from: in.abit  irreversible: 22076791 (-16)                 application.cpp:499
...
2017-11-24T18:00:01 th_a:invoke handle_block         handle_block ] Got block: #22078003 time: 2017-11-24T18:00:00 latency: 1081 ms from: delegate-1.lafona  irreversible: 22077987 (-16)                       application.cpp:499
@oxarbitrage
Copy link
Member

very interesting. let me ask you some stuff to understand better. is the 3 slots skipping a parameter that can be changed in case this blocks became more expensive or it is just happening because it is the time it is currently taking ? should we be be thinking on modifying witness plugin so the next witness wait 9 secs(or whatever) instead of 3 when we are at maintain time?
unsure about the technicals but maybe a flag of last block before next witness send maint block ?

on a side note there is an old issue at #211 that proposes plugins to add tasks to maintenance, something i think it will not be a good idea.

@xeroc
Copy link
Member

xeroc commented Nov 24, 2017

Iirc, the number of blocks that can be skipped at most is a committee parameter and can thus be changed

@oxarbitrage
Copy link
Member

@abitmore
Copy link
Member Author

It's getting worse.

2018-02-16T10:00:04 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbf1a1c9e6a864b8e82f4d4d6a29481e0f1 time
: 2018-02-16T10:00:03 latency: 1711 ms from: roelandp  irreversible: 24472495 (-16)                       application.cpp:500
2018-02-16T10:00:05 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbf1a1c9e6a864b8e82f4d4d6a29481e0f1 time
: 2018-02-16T10:00:03 latency: 2330 ms from: roelandp  irreversible: 24472496 (-15)                       application.cpp:500
2018-02-16T10:00:07 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbfb2eef6e3a8ab42010a93ce87555178e5 time
: 2018-02-16T10:00:00 latency: 7453 ms from: delegate-zhaomu  irreversible: 24472496 (-15)                        application.cpp:500
2018-02-16T10:00:09 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbfff6fbe4e527fecbd3c686f8110fc3155 time
: 2018-02-16T10:00:06 latency: 3613 ms from: witness.still  irreversible: 24472496 (-15)                  application.cpp:500
2018-02-16T10:00:18 th_a:invoke handle_block         handle_block ] Got block: #24472512 01756bc0a9a7083bd4fdb0545b5ed162ac78f23f time
: 2018-02-16T10:00:18 latency: 683 ms from: wackou  irreversible: 24472496 (-16)                  application.cpp:500

@pmconrad
Copy link
Contributor

Hm, any idea what's causing this?
Perhaps need to add some timing output in various places during maintenance?
Or try to run a replay in profiling mode? That might take ages...

@abitmore
Copy link
Member Author

This is a plugin issue, although the logic is tricky.

When the witness plugin didn't receive a block in 3 seconds then it tries to produce another block, actually this can be profiled, say, if it took itself 3 seconds to produce the new block, it should know the previous witness need 3 seconds to produce and some time to propagate, so the best guess should be it should have received the block now, so best to validate it (may need time as well) and build the next block based on it, but not simply broadcast its own maintenance block (then miss its turn).

Anyway, maintenance blocks are meant to need longer time to produce, we need to figure out the best way to handle them, which is independent from what's done in the maintenance interval. Above comment is a direction.

On the other hand, you've asked, why we now need so long time to produce the block? One simple reason is we have more and more accounts, so the node need longer time to iterate through them to re-tally votes. We may need to improve this mechanism at some time, one option could be learning from Steem (update votes on every block).

@abitmore
Copy link
Member Author

Similar new log entries:

2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08ce4785cf6577a3202517b4bda4464e40c time: 2018-02-19T22:00:00 latency: 4013 ms from: witness.still  irreversible: 24572027 (-17)                  application.cpp:500
2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08ce4785cf6577a3202517b4bda4464e40c time: 2018-02-19T22:00:00 latency: 4584 ms from: witness.still  irreversible: 24572027 (-17)                  application.cpp:500
2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08c0b8aa7c3f267beb5846938b844d214e7 time: 2018-02-19T22:00:03 latency: 1843 ms from: abc123  irreversible: 24572027 (-17)                 application.cpp:500
2018-02-19T22:00:07 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08c8bf1ba0cfe3e592fa75015908711cfed time: 2018-02-19T22:00:06 latency: 1868 ms from: sahkan-bitshares  irreversible: 24572027 (-17)                       application.cpp:500
2018-02-19T22:00:18 th_a:invoke handle_block         handle_block ] Got block: #24572045 0176f08d071472f054c0628dcf9bd6e17121d5a0 time: 2018-02-19T22:00:18 latency: 385 ms from: verbaltech2  irreversible: 24572027 (-18)                     application.cpp:500

@abitmore
Copy link
Member Author

Now I tend to believe that current behavior / implementation is the best approach in the worst scenario, so likely won't change it.

I will close this issue if there is no new feedback.

@abitmore
Copy link
Member Author

Copied @pmconrad's comment in #1157 (comment) to this issue since it's related:

We should define desirable behaviour before implementing fixes.

IMO, if a node hasn't received a block from its predecessor when its time slot has come, it should simply produce its own block in time.

  • Waiting until its own slot is nearly over is counterproductive, this will only propagate the problem to the next witness in line.
  • Producing a block on time is also desirable from the user's perspective, because that will get his transaction approved more quickly.
  • Not waiting for the previous block increases the chance of a fork between the previous block and the next. Waiting for the previous block increases the chance of a fork between this block and the next. So in the end waiting doesn't gain us anything (but increases the risk if the cause is not latency but witness failure).

There is one special case though: if the previous block has been received in time but takes a long time to apply (as is often the case in a maintenance block), then it makes sense to produce one even if our slot is nearing its end. I would extend the deadline in that special case only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants