Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation: Flush state after initial sync #15218

Closed
wants to merge 1 commit into from

Conversation

@andrewtoth
Copy link
Contributor

@andrewtoth andrewtoth commented Jan 20, 2019

This fixes a common issue where initial sync is done with a high dbcache and then not cleanly shut down.

Resolves #11600.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch Jan 20, 2019
src/validation.cpp Outdated

// Flush state after IBD is finished, but don't block return
void (*flush_ptr)() = &FlushStateToDisk;
std::thread(flush_ptr).detach();

This comment has been minimized.

@practicalswift

practicalswift Jan 20, 2019
Contributor

Could be written as std::thread(FlushStateToDisk).detach();?

This comment has been minimized.

@andrewtoth

andrewtoth Jan 20, 2019
Author Contributor

No, because FlushStateToDisk is overloaded and the compiler can't determine which function to call in that case.

This comment has been minimized.

@practicalswift

practicalswift Jan 20, 2019
Contributor

@andrewtoth Ah, yes! Thanks!

$ git grep ' FlushStateToDisk.*{'
src/validation.cpp:bool static FlushStateToDisk(const CChainParams& chainparams, CValidationState &state, FlushStateMode mode, int nManualPruneHeight) {
src/validation.cpp:void FlushStateToDisk() {
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch Jan 21, 2019
@laanwj
Copy link
Member

@laanwj laanwj commented Jan 21, 2019

Concept ACK, but I think IsInitialBlockDownload is the wrong place to implement this, as it's a query function, having it suddenly spawn a thread that flushes is unexpected.

Would be better to implement it closer to the validation logic and database update logic itself.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch Jan 21, 2019
@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Jan 21, 2019

@laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch Jan 22, 2019
src/validation.cpp Outdated
static bool initially_in_ibd = true;
bool finished_ibd = initially_in_ibd && !IsInitialBlockDownload();
if (finished_ibd) {
initially_in_ibd = false;

This comment has been minimized.

@practicalswift

practicalswift Jan 22, 2019
Contributor

Wrong indentation :-)

This comment has been minimized.

@andrewtoth

andrewtoth Jan 22, 2019
Author Contributor

Fixed.

@laanwj
Copy link
Member

@laanwj laanwj commented Jan 22, 2019

@laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.

Thanks, much better!

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch Jan 22, 2019
src/validation.cpp Outdated
@@ -2751,7 +2751,13 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
CheckBlockIndex(chainparams.GetConsensus());

// Write changes periodically to disk, after relay.
if (!FlushStateToDisk(chainparams, state, FlushStateMode::PERIODIC)) {
// Unless we just finished initial sync, then always write to disk
static bool initially_in_ibd = true;

This comment has been minimized.

@ken2812221

ken2812221 Jan 22, 2019
Contributor

I think that this variable should be the member of CChainState to avoid share same variable if we have two CChainState object in the program.

This comment has been minimized.

@andrewtoth

andrewtoth Jan 23, 2019
Author Contributor

Done.

@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jan 22, 2019

I'm not really a fan of this change -- the problem described in #11600 is from an unclean shutdown (ie system crash), where our recovery code could take a long time (but typically would be much faster than doing a -reindex to recover, which is how our code used to work).

This change doesn't really solve that problem, it just changes the window in which an unclean shutdown could occur (reducing it at most by 24 hours). But extra flushes, particularly during initial sync, aren't obviously a good idea, since they harm performance. (Note that we leave IBD before we've synced all the way to the tip, I think once we're within a day or two?)

Because we flush every day anyway, it's hard for me to say that this is really that much worse, performance-wise (after all we don't currently support a node configuration where the utxo is kept entirely cached). But I'm not sure this solves anything either, and a change like this would have to be reverted if, for instance, we wanted to make the cache actually more useful on startup (something I've thought we should do for a while). So I think I'm a -0 on this change.

@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Jan 23, 2019

@sdaftuar This change also greatly improves the common workflow of spinning up a high performance instance to sync, then immediately shutting it down and using a cheaper one. Currently, you have to enter it and do a clean shutdown instead of just terminating. Similarly, when syncing to an external drive, you can now just unplug the drive or turn off the machine when finished.

I would argue that moving the window to 0 hours directly after initial sync is an objective improvement. There is a lot of data that will be lost directly after, so why risk another 24 hours? After that, the most they will lose is 24 hours worth of rolling back, instead of 10 years. Also, this change does not do any extra flushes during initial sync, only after.

I can't speak to your last point about changing the way we use the cache, since I don't know what your ideas are.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch to 4787054 Jan 23, 2019
@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jan 23, 2019

Currently, you have to enter it and do a clean shutdown instead of just terminating.

@andrewtoth We already support this (better, I think) with the -stopatheight argument, no?

I don't really view data that is in memory as "at risk"; I view it as a massive performance optimization that will allow a node to process new blocks at the fastest possible speed while the data hasn't yet been flushed. I also don't feel very strongly about this for the reasons I gave above, so if others want this behavior then so be it.

@sipa
Copy link
Member

@sipa sipa commented Jan 23, 2019

@sdaftuar Maybe this is a bit of a different discussion, but there is another option; namely supporting flushing the dirty state to disk, but without wiping it from the cache. Based on our earlier benchmarking, we wouldn't want to do this purely for maximizing IBD performance, but it could be done at specific times to minimize losses in case of crashes (the once per day flush for example, and also this IBD-is-finshed one).

@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jan 23, 2019

@sipa Agreed, I think that would make a lot more sense as a first pass optimization for the periodic flushes and would also work better for this purpose as well.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Jan 24, 2019

. Currently, you have to enter it and do a clean shutdown instead of just terminating.

Well with this, if you "just terminate" you're going to end up with a replay of several days blocks at start, which is still ugly, even if less bad via this.

Aside, actually if you actually shut off the computer any time during IBD you'll likely completely corrupt the state and need to reindex because we don't use fsync during IBD for performance reasons.

We really need to get background writing going, so that our writes are never more than (say) a week of blocktime behind... but that is a much bigger change, so I don't suggest "just do that instead", though it would make the change here completely unnecessary.

Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Jan 25, 2019

@andrewtoth We already support this (better, I think) with the -stopatheight argument, no?

@sdaftuar Ahh, I never considered using that for this purpose. Thanks!

@gmaxwell It might still be ugly to have a replay of a few days, but much better than making everything unusable for hours.

There are comments from several people in this PR about adding background writing and writing dirty state to disk without wiping the cache. This change wouldn't affect either of those improvements, and is an improvement by itself in the interim.

As for moving this to the scheduler thread, I think this is better since it happens in a place where periodic flushes are already expected Also, checking every 30 seconds for a new block wouldn't work if for instance the network cuts out for a few minutes.

@sipa
Copy link
Member

@sipa sipa commented Jan 25, 2019

@andrewtoth The problem is that right now, causing a flush when exiting IBD will (temporarily) kill your performance right before finishing the sync (because it leaves you with an empty cache). If instead it was a non-clearing flush, there would be no such downside.

@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jan 29, 2019

My experiment in #15265 has changed my view on this a bit -- now I think that we might as well make a change like this for now, but should change the approach slightly to do something like @gmaxwell's proposal so that we don't trigger the flush before we are done syncing:

Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch 2 times, most recently from f1be35e to 442db9d Feb 10, 2019
@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Feb 10, 2019

@sdaftuar @gmaxwell I've updated this to check every 30 seconds on the scheduler thread if there has been an update to the active chain height. This only actually checks after IsInitialBlockDownload is false, which happens if latest block is within a day of the current time.

I'm not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

src/init.cpp Outdated Show resolved Hide resolved
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch 2 times, most recently from 79a9ed2 to 3abbfb0 Feb 10, 2019
@MarcoFalke
Copy link
Member

@MarcoFalke MarcoFalke commented Apr 22, 2019

utACK 5d9aa4c

@sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Apr 22, 2019

I'm not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

There's a variable called nPeersWithValidatedDownloads in net_processing.cpp which indicates how many peers we are downloading blocks from. So to implement @gmaxwell's suggestion I think you would just need to expose that variable and then check to see if it equals 0.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch from 5d9aa4c to 44e0182 Apr 23, 2019
@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Apr 23, 2019

@sdaftuar Thanks. Updated.

@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch from 44e0182 to 6f76c4f Apr 23, 2019
src/net_processing.cpp Outdated Show resolved Hide resolved
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch from 6f76c4f to b32fca5 Apr 23, 2019
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch 2 times, most recently from 010dab3 to 61af83d Jul 31, 2019
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch from 61af83d to f77c316 Jul 31, 2019
@andrewtoth andrewtoth force-pushed the andrewtoth:flush-after-ibd branch from f77c316 to d2ecb70 Jul 31, 2019
@TheBlueMatt
Copy link
Contributor

@TheBlueMatt TheBlueMatt commented Aug 21, 2019

-0.5. This is really not the right way to go here, I think. Namely, people with high dbcache may also want to use it after IBD completes (ie cause they want to connect blocks quickly), and this breaks that by wiping the cache completely. Coupling it with a change to not drop non-dirty entries while flushing is the easy fix to make it not have a regression, though a more complete cleanup of background flushing would also be nice.

@jamesob
Copy link
Member

@jamesob jamesob commented Aug 21, 2019

this breaks that by wiping the cache completely

On a sort of related note, I think it's kind of crazy that we couple writing back to disk with emptying the in-memory cache. Feels like those should be separate considerations, especially in a case like this.

@laanwj
Copy link
Member

@laanwj laanwj commented Oct 8, 2019

This PR seems to be too controversial to merge in the current state, and discussion has become inactive. Closing as up for grabs. Let me know if you start work on this again and I'll re-open it.

@MarcoFalke
Copy link
Member

@MarcoFalke MarcoFalke commented Nov 15, 2019

Could rebase on top of something like: #17487, and then open a new pull request and link to this discussion which is mostly about "NACK, because this will erase the cache".

@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Nov 16, 2019

@MarcoFalke Thanks for bringing that to my attention. What is the benefit of making a new PR instead of reopening this one? Just for visibility?

@MarcoFalke
Copy link
Member

@MarcoFalke MarcoFalke commented Nov 16, 2019

Up to you. Let me know if you want this reopened.

@luke-jr
Copy link
Member

@luke-jr luke-jr commented Nov 16, 2019

The benefit is that the no-longer-applicable criticism is no longer right there at first glance.

@MarcoFalke
Copy link
Member

@MarcoFalke MarcoFalke commented Nov 22, 2019

Given that it was found this does not negatively affect performance, see #15265 (comment), I think it should be reconsidered in its current form.

@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Nov 22, 2019

@MarcoFalke I don't think the takeaway from that comment is that it doesn't affect performance. That was benchmarking IBD for pruned nodes with not emptying the cache on flush. This PR is flushing the dbcache after IBD. I'm sure for a non-pruned node with a high dbcache this will negatively affect performance directly after the flush, as @TheBlueMatt 's comment points out.

@andrewtoth
Copy link
Contributor Author

@andrewtoth andrewtoth commented Nov 22, 2019

If #17487 gets merged I will rebase this, perhaps without checking for active peers since the cache would no longer be affected and it will make the diff smaller.

@Sjors
Copy link
Member

@Sjors Sjors commented Nov 23, 2019

Concept ACK on doing a Flush(erase=false) at the end of IBD once support for that is merged. It probably still makes sense to wait a little bit (e.g. checking for active peers), otherwise the user might be bothered with this flush right before sync really finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.