Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache 26% more coins: Reduce CCoinsMap::value_type from 96 to 76 bytes #17060

Open
wants to merge 3 commits into
base: master
from

Conversation

@martinus
Copy link
Contributor

martinus commented Oct 5, 2019

This is an attempt to more tightly pack the data inside CCoinsMap. (replaces #16970, done after my investigations on #16957) Currently, there is quite a bit of overhead involved in the CCoinsMap::value_type data structure (total bytes to the left):

96 CCoinsMap::value_type
    36 COutPoint
        32 uint256
         4 uint32_t
     4 >>> PADDING <<<
    56 CCoinsCacheEntry
        48 Coin
            40 CTxOut
                8 nValue
                32 CScript
             4 fCoinBase & nHeight
             4 >>> PADDING <<<
         1 flags (dirty & fresh)
         7 >>> PADDING <<< 

So there is quite a bit of padding data. In my experiements I've noticed that the compiler is forced to use a padding size >=8 only because nValue's CAmount type is int64_t which has to be 8-byte aligned. When replacing nValue with a 4-byte aligned data structure, the whole CCoinsMap::value_type will be aligned to 4 bytes, reducing padding.

Another 4 bytes can be saved by refactoring prevector to only use a single byte for direct size information, and reducing CScript's size from 28 to 27 bytes. It is still able to directly cache most scripts as most are <= 25 bytes long.

The remaining 4 bytes due to the 1 byte flag + 3 bytes padding in CCoinsCacheEntry can be removed by moving the flags into Coin, stealing another 2 bits from nHeight.

Finally, the resulting data structure is 20 bytes smaller:

76 CCoinsMap::value_type
    36 COutPoint
        32 uint256
         4 uint32_t
    40 CCoinsCacheEntry
        40 Coin
            36 CTxOut
                 8 nValue
                28 CScript
            4 fCoinBase & nHeight & flags (dirty & fresh)

So we can store about 26% more data into dbcache's memory. I have evalued this on my Intel i7-8700, 32GB RAM, external SSD, for -reindex-chainstate with both -dbcache=500 and -dbcache=5000:

out2

    time max resident set size
master -dbcache=500 05:57:20 2957532
2019-09-more-compact-Coin -dbcache=500 05:33:37 2919312
Improvement   6,64% 1,29%
    time max resident set size
master -dbcache=5000 04:22:42 7072612
2019-09-more-compact-Coin -dbcache=5000 04:09:16 6533132
Improvement   5,11% 7,63%

So on my machine there is definitive an improvement, but honestly I am not sure if this ~6% improvement is worth the change. Maybe the improvement is bigger with slower disk, as ~26% more transaction can be cached.

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Oct 5, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #15606 ([experimental] UTXO snapshots by jamesob)
  • #9384 (CCoinsViewCache code cleanup & deduplication by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@Sjors

This comment has been minimized.

Copy link
Member

Sjors commented Oct 6, 2019

Would be nice to test this on a Raspberry-like device. They tend to have low memory and slow storage. Unfortunately they generally also require pruning, and the way pruning currently works we flush the coin cache.

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I'm pruning it to 5 GB, it never uses more than 230 MB.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented Oct 7, 2019

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I'm pruning it to 5 GB, it never uses more than 230 MB.

A 512 GB micro SD card should do the job, no?

@fanquake

This comment has been minimized.

Copy link
Member

fanquake commented Oct 8, 2019

@TheBlueMatt You might be interested here given you commented in #16970.

@laanwj

This comment has been minimized.

Copy link
Member

laanwj commented Oct 9, 2019

A 512 GB micro SD card should do the job, no?

Yep, works fine.

src/primitives/transaction.h Outdated Show resolved Hide resolved
@fanquake

This comment has been minimized.

Copy link
Member

fanquake commented Oct 9, 2019

Please lets keep the discussion high level before pointing out any typos.

Copy link
Member

promag left a comment

Concept ACK. Didn't verified the claim but 26% sounds a good improvement to me.

AFAICT dirty and fresh are almost read and written at the same time, maybe use a 2 bit flag? Or is compiler doing that anyway?

@laanwj

This comment has been minimized.

Copy link
Member

laanwj commented Oct 12, 2019

@promag that's what the bitfield syntax unsigned int dirty : 1; does, it makes the flags take up the minimum possible space.
Or maybe I'm misunderstanding what you're trying to accomplish.

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 12, 2019

Would like to test this on my VM using a HDD.
I'm not sure how to benchmark this - I've tried https://github.com/chaincodelabs/bitcoinperf but haven't succeeded. Any info/help would be appreciated.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented Oct 12, 2019

bitcoinperf is a bit messy to set up. I think it requires docker. Is there any specific issue you ran into? I guess it might be better to report such upstream at https://github.com/chaincodelabs/bitcoinperf to not bloat this thread

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 12, 2019

Would like to test this on my VM using a HDD.
I'm not sure how to benchmark this - I've tried https://github.com/chaincodelabs/bitcoinperf but haven't succeeded. Any info/help would be appreciated.

In my benchmarks I deleted debug.log, ran /usr/bin/time -v ./bitcoind -reindex-chainstate -stopatheight=594000 -printtoconsole=0 -dbcache=500, then used a hand written script to convert the debug.log into gnuplot-readable data, then used gnuplot to create the graphs.

It would be awesome to have some kind of logger in bitcoind that can periodically log statistics into a file in an easily parseable data.

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 12, 2019

There were a couple issues. Could you recommend anything else to test -reindex-chainstate? Should I simply run bitcoind and have linux count time ... ?
Or maybe I'll look around things mentioned in this stackexchange answer.

p.s. thanks for the info @martinus. I'll figure something out with what you did and those tools mentioned in the link.

@promag

This comment has been minimized.

Copy link
Member

promag commented Oct 12, 2019

@laanwj I mean stuff like this:

        auto oldDirty = coin.dirty;
        auto oldFresh = coin.fresh;

could be

        auto old_cache = coin.cache;

where

    unsigned int cache : 2;   // DIRTY = 1,  FRESH = 2
@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 15, 2019

The debug.log from first benchmark (master) got overwritten by the second benchmark, and I didn't make a script to do a graph from debug.log yet anyway, so I'm just gonna post the /usr/bin/time -v results.
The improvement seems remarkable! I tested this on Ubuntu 16 VM running on a Seagate SSHD (also called bybrid) disk. It seems however the process used more CPU - not sure what's going on with those percentages though. Must be a VM bug.

First Header Second Header time max resident set size
2019-09-more-compact-Coin dbcache=500 2:20:09 1673108
Master dbcache=500 3:25:46 1729412
Improvement 31,9% 3,26%

	Command being timed: "./bitcoindmaster -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
	User time (seconds): 19760.08
	System time (seconds): 1031.02
	Percent of CPU this job got: 168%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:25:46
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1729412
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 2088592
	Voluntary context switches: 38379113
	Involuntary context switches: 2185645
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

	Command being timed: "./bitcoindcoin -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
	User time (seconds): 19004.87
	System time (seconds): 598.24
	Percent of CPU this job got: 233%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20:09
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1673108
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1737560
	Voluntary context switches: 17203875
	Involuntary context switches: 2926137
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 15, 2019

Thanks @GChuf for testing this! This are really nice results.

It seems however the process used more CPU

I don't think so:

  • master had 19760.08 sec user + 1031.02 sec system = 20791.1 sec
  • 2019-09-more-compact-Coin had 19004.87 sec user + 598.24 sec system = 19603,11

So the branch used ~6% less CPU. Note that the percentage 233% is much higher than for master because it used that CPU in a much shorter timeframe.

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 16, 2019

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

Thanks for the explaination! The math works :)
If more testing is needed I'd be glad to help, but I think HDDs will show similar results to my SSHD and it's clear that the improvements are real.

@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Oct 17, 2019

Concept ACK from me, seems like some great savings here. Though it'll be crucial to test this thoroughly across platforms. Will review in depth soon.

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 17, 2019

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

@sipa

This comment has been minimized.

Copy link
Member

sipa commented Oct 17, 2019

These are pretty impressive gains, so concept ACK.

Some overall comments:

  • Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.
  • The coding style guidelines for variables say to use snake_case in new code.

I do want to review the prevector changes in more detail.

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 17, 2019

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

Sorry, I build only in Linux, never tried to build in Windows

Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.

I'm also unsure if it was everywhere necessary to backup and then restore the previous flags. Moving the flags into Coin only saves us 4bytes. I could also remove that change and do this in a separate PR.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented Oct 17, 2019

@DrahtBot might have a windows build ready tomorrow

@dongcarl

This comment has been minimized.

Copy link
Contributor

dongcarl commented Oct 17, 2019

I have Gitian-built windows binaries of 03cb535 here for those who are keen (@GChuf): https://send.firefox.com/download/c78ac103a4dabe86/#S70uxKCUua-baEeR0v029g

@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Oct 18, 2019

Hm, getting the opposite of what I would expect in terms of benchmark results. For /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0 (run using a lightly modified version of this script) across a number of different machines, I'm seeing this branch as reliably slower than master:

bench-ssd-2  master                               4:05:52    0.45  6283.55MB 204%
bench-ssd-2  martinus/2019-09-more-compact-Coin   9:06:48    1.00  5927.98MB 347%

bench-ssd-3  martinus/2019-09-more-compact-Coin   9:07:05    1.00  5914.48MB 346%
bench-ssd-3  master                               2:59:52    0.33  6278.71MB 77%

bench-ssd-4  master                               2:33:47    0.28  6257.46MB 89%
bench-ssd-4  martinus/2019-09-more-compact-Coin   9:07:19    1.00  5834.16MB 346%

bench-ssd-5  martinus/2019-09-more-compact-Coin   9:03:16    1.00  5963.10MB 349%
bench-ssd-5  master                               3:58:48    0.44  6252.48MB 209%

bench-hdd-4  master                               6:22:47    0.64  6263.75MB 36%
bench-hdd-4  martinus/2019-09-more-compact-Coin   10:02:42   1.00  5908.06MB 314%

bench-hdd-5  martinus/2019-09-more-compact-Coin   9:52:42    1.00  5889.18MB 320%
bench-hdd-5  master                               6:54:05    0.70  6240.25MB 120%

Sample machine specs:

Hostname:            bench-ssd-2
Kernel:              Linux 4.9.0-8-amd64
OS:                  Debian GNU/Linux 9
RAM (GB):            7.71
Architecture:          x86_64
CPU(s):                4
Thread(s) per core:    1
Core(s) per socket:    4
Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz

Can anyone think of an explanation for this?


Also, FWIW, in your original PR description your graph and table show master as being faster than this branch but when I read it initially I'd assumed it was just a labeling error.

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 18, 2019

Can anyone think of an explanation for this?

I really can't explain why the diffference is so huge. Maybe it has something to do with the Xeon CPU? Or there is still a bug in the code that manifests on that CPU. Do you have the debug.log files, or graphs of the progress? Maybe there it's visible at which point the difference occurs.

I'm currently running a comparison on a very slow Intel Celeron N3050, 2GB of RAM, -dbcache=500, and external SSD. So far it seems the results are very close

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Oct 19, 2019

Gitian builds for commit ec3ed5a (master):

Gitian builds for commit c03bdee (master and this pull):

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 19, 2019

@jamesob please post the debug logs, some time ago I was looking at your results at #16801 which seemed weird as well, and debug logs could shed some much needed light into this.
As for Xeon CPUs, I've got a Xeon E3 1240 v2. The previous test on ubuntu VM was run with this CPU and I'm gonna run windows benchmarks on it as well and post the results here. Anyway I doubt xeon (or any other) intel CPUs are the problem, thought it's possible.

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 20, 2019

Much more modest improvements when running this on a physical Windows machine (~2% improvement in time). Same hardware was used as in my previous benchmark.
I might have seen better improvements if I used an old HDD, but I just wanted to make sure the improvements are there on windows as well.

master: 02:59:21
2019-09-more-compact-Coin: 02:56:05

figure_1

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 20, 2019

@GChuf , did you run the windows benchmark on the same hardware where you got the 31% improvement in Linux?

My benchmark with Intel Celeron N3050 has finished, where the branch is about of 1.4% faster. It seems that the CPU is definitely the limiting factor here - each run took over 6 days

out2

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 20, 2019

@martinus yes, forgot to mention. Hardware was the same. The difference was that on the first run I was running the benchmark on a VM, which certainly performs worse. I think the difference between physical vs VM is more pronounced with CPUs rather than disks.

@JeremyRubin

This comment has been minimized.

Copy link
Contributor

JeremyRubin commented Oct 20, 2019

Some notes on other paths to shave off bytes here:

As an alternative, you could do something like the union below to compact the CAmount for the (evidence: https://eprint.iacr.org/2017/1095.pdf, also by pigeon hole we have 21M coins and 100+M UTXOs so most have to be small) common case of a UTXO smaller than 2**31/100e6 (about 21.47 BTC). This technique saves another 4 bytes. The nice thing about this change is that we don't have to change the guts of CScript which has implications across a bunch of sensitive areas.

Were you to combine it with the prevector and CScript changes, you would see the same improvement of 4 bytes, bringing it down to 72 bytes.

There is also the option to hash the COutpoint, to save another 4 bytes, but I looked into this once and it seemed complicated because some cases we use the key of the pair IIRC. (The same trick as above could be used to shave off 2 bytes as almost all outputs indexes are < 16 bits)

union {
struct {
    bool is_big : 1;
    private:
    uint32_t amount: 31;
} read_flag;
struct {
    struct {
        private:
        bool __flag : 1;
        public:
        uint32_t amount: 31;
    } flag_amount;
    CScript script;
} small_value;
struct {
    struct {
        private:
        bool __flag : 1;
        public:
        uint32_t ignore: 31;
    } flag;
    std::unique_ptr<CTxOut> ptr;
} large_value;
}
@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Oct 21, 2019

Just finished another benchmark run on a significantly beefier machine with similar (odd) results. Only the debug.log for this branch was preserved since I started the run a few days ago, but (i) I'm working on some better tooling to parse debug.log files (optionally contingent on #16805) and generate tables, graphs, etc., and (ii) in future runs I'll make sure I preserve the debug.log files for each branch.

/usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0

Edit: compiled with g++ (Debian 8.3.0-6) 8.3.0.

master

{'time': '2:56:18', 'cpu_perc': '68%', 'mem_kb': 6389536, 'user_time_secs': 6754.8, 'system_time_secs': 527.61}

martinus/2019-09-more-compact-Coin (03cb535)

{'time': '9:16:02', 'cpu_perc': '635%', 'mem_kb': 5871084, 'user_time_secs': 211292.35, 'system_time_secs': 565.91}

debug.log (martinus/2019-09-more-compact-Coin)

https://transfer.sh/ncVfW/jamesob-17060-debug.log

hwinfo

hostname               bench-strong        
cpu_model_name         Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ram_gb                 31.35
os                     ['Debian GNU/Linux', '10', 'buster']
arch                   x86_64              
kernel                 4.19.0-5-amd64      
read_iops (/tmp)       29.2k
write_iops (/tmp)      9694
read_iops (/data)      396
write_iops (/data)     131
@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 21, 2019

Thanks for posting the debug.log! That's already helpful. I was hoping to find some issue where e.g. db sync takes a long time, but this is not the case. Flushing the db cache takes a few minutes, but nothing extraordinary:

2019-10-18T08:30:54Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.0MiB(0txo)
2019-10-18T10:10:27Z UpdateTip: new best=00000000000000000d533060763bd64c8b97de7fdbc407bd918cf97ff37ae69a height=372609 version=0x00000003 log2_work=83.29081 tx=82023512 date='2015-09-02T02:22:29Z' progress=0.175964 cache=4275.5MiB(35061266txo)
2019-10-18T10:17:07Z UpdateTip: new best=0000000000000000056e394d44a7ae5eb021d6171504a7dec0a833d50763fad6 height=372610 version=0x00000003 log2_work=83.290838 tx=82024083 date='2015-09-02T02:26:03Z' progress=0.175965 cache=482.8MiB(0txo)
2019-10-18T12:15:33Z UpdateTip: new best=00000000000000000168ff4ff0e7a5ddf08c707fe928be19a25a9c90e8444dab height=433063 version=0x20000000 log2_work=85.368569 tx=160962375 date='2016-10-06T00:55:30Z' progress=0.345290 cache=4276.1MiB(35338561txo)
2019-10-18T12:23:07Z UpdateTip: new best=000000000000000000b7619d54035e1ad58c762daef103369d2c04a8641fb20a height=433064 version=0x20000000 log2_work=85.368598 tx=160962901 date='2016-10-06T00:58:57Z' progress=0.345290 cache=482.8MiB(0txo)
2019-10-18T14:52:48Z UpdateTip: new best=0000000000000000002a003b9e34d263f117db2562c80171080e31d4db61bd72 height=490207 version=0x20000000 log2_work=87.294564 tx=262721239 date='2017-10-16T23:03:01Z' progress=0.563536 cache=4276.0MiB(35487228txo)
2019-10-18T15:00:54Z UpdateTip: new best=000000000000000000564f989b2bd03905df412506a20b2a4708a52b1c230787 height=490208 version=0x20000000 log2_work=87.294603 tx=262723684 date='2017-10-16T23:11:33Z' progress=0.563539 cache=482.8MiB(0txo)
2019-10-18T16:40:48Z UpdateTip: new best=00000000000000000032015afc76ed6c6e3e6fabae75f832e8f3d6f1a79bb2bd height=526412 version=0x20000000 log2_work=88.977034 tx=320908996 date='2018-06-07T10:09:17Z' progress=0.688312 cache=4275.9MiB(35483120txo)
2019-10-18T16:47:17Z UpdateTip: new best=00000000000000000031989467b9aa7c18b75ccafd0ca3f788d5e413dbb86e4a height=526413 version=0x20000000 log2_work=88.977084 tx=320909164 date='2018-06-07T10:22:49Z' progress=0.688310 cache=482.8MiB(0txo)
2019-10-18T17:43:44Z UpdateTip: new best=000000000000000000223b7a2298fb1c6c75fb0efc28a4c56853ff4112ec6bc9 height=550000 version=0x20000000 log2_work=90.011241 tx=356588225 date='2018-11-14T02:35:41Z' progress=0.764816 cache=2880.1MiB(22413485txo)

It looks like progress is made quite linearly, so whatever is slowing it down, it seems to happen from start to finish.The really odd thing is the extreme CPU usage in the branch, it's 31 times higher than on master.

Could you try to run perf top while syncing on the branch, maybe this gives an indication where so much CPU is spent? or create a flamegraph with hotspot. Then can't use pyperf system tune though because it severely reduces the number of allowed samples

@GChuf

This comment has been minimized.

Copy link
Contributor

GChuf commented Oct 21, 2019

@martinus @jamesob looking at the first test results from James, it occured to me that the "problem" must be in the CPU (or at least not in the disks) - benching master on ssd vs hdd turned out to be ~2x faster (as expected), whereas benching this branch took 9h on ssd and 10h on hdd.

James, you mentioned this in #16801:

Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I'm dropping all caches (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;), tuning with pyperf (sudo /usr/local/bin/pyperf system tune;), and I've ensured that the CPU governors are set to performance.

Besides what Martin suggested, can you try:

  • not running pyperf & CPU governor
  • using gitian builds above
  • using -printtoconsole=0 (This probably won't change anything, but since we have no idea what's going on ...)
@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Oct 29, 2019

Okay, I've finally gotten around to taking perf to the weird bench-strong results. I took measurements at progress=0.20 (roughly height 387,000) by doing

$ # configure with -fno-omit-frame-pointer, tune system params to enable perf, make, etc.
$ ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=7000 -printtoconsole=0 -connect=0 &
$ while ! tail -n 10000 ~/.bitcoin/debug.log | grep progress=0.2; do sleep 3; done; kill -STOP `pidof bitcoind`
[... I wait ...]
$ kill -CONT `pidof bitcoind` && sleep 0.4 && perf record -g -F 101 -p `pidof bitcoind` -o perf.$THE_BRANCH.data sleep 90

After running the above for both this branch and master at 6a97e8a, here's perf diff perf.master.data perf.martinus.data:

# Event 'cycles:ppp'
#
# Baseline  Delta Abs  Shared Object              Symbol
# ........  .........  .........................  ......................................................................................
#
     2.41%     +3.06%  bitcoind                   [.] SipHashUint256Extra
    15.36%     +2.93%  bitcoind                   [.] secp256k1_fe_mul_inner
     4.09%     -2.21%  libc-2.28.so               [.] 0x000000000007b544
     0.26%     +1.52%  bitcoind                   [.] (anonymous namespace)::ripemd160::rol
    14.02%     -1.42%  bitcoind                   [.] secp256k1_fe_sqr_inner
     1.99%     -1.40%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Xor
     1.48%     -1.32%  bitcoind                   [.] operator new
               +1.12%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator!=
               +1.10%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator++
     0.00%     +1.07%  bitcoind                   [.] ser_writedata32<CSizeComputer>
               +0.99%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::is_direct
               +0.96%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::size
     0.03%     +0.92%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Or

I think the percentages are misleading here because when I run sudo perf top --comms=b-loadblk, this is what I see:

Samples: 236K of event 'cycles:ppp', 2000 Hz, Event count (approx.): 252595020173
Overhead  Shared Object              Symbol
   2.79%  bitcoind                   [.] SipHashUint256Extra
   1.00%  bitcoind                   [.] base_blob<256u>::GetUint64
   0.56%  bitcoind                   [.] std::__detail::_Hash_code_base<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::__
   0.50%  bitcoind                   [.] SaltedOutpointHasher::operator()
   0.45%  bitcoind                   [.] prevector<28u, unsigned char, unsigned int, int>::const_iterator::operator*
   0.33%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
   0.26%  bitcoind                   [.] ser_writedata32<CHashWriter>
   0.23%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
   0.22%  bitcoind                   [.] std::__uninitialized_default_n_1<false>::__uninit_default_n<CTxOut*, unsigned long>
   0.18%  libc-2.28.so               [.] 0x000000000015878a
   0.17%  bitcoind                   [.] operator-

In other words, near as I can figure SipHashUint256Extra clocks in as the bottleneck on the loadblk thread, which is where the reindexing happens. Based on the (unreliable?) percentage values, it looks like this branch is spending more than twice the time hashing for map access than master is. Why this would be (or even whether or not this is the actual cause of the difference) is outside of my understanding.

I can upload the raw perf data files if anyone wants, but to my understanding they aren't portable - unfortunately perf is not very well documented, and I had trouble even generating flamegraphs on a secondary machine, so I'm not really sure what the best way to make the datafiles portable is.

@elichai

This comment has been minimized.

Copy link
Contributor

elichai commented Oct 29, 2019

@jamesob twice the time or twice the percentage?
Because I can confirm that my own benchmarks of master from ~2-3 weeks ago showed SipHashUint256Extra very very high on the list (spent a while re-implementing it with SSE instructions just to see it's actually slower with them lol)

(i'm usually using the following flags: perf report -g 'graph,0.5,caller' that way it sorts the percentages by caller instead of callee, though you probably know more about perf than I :) )

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 29, 2019

@jamesob did your master build already contain #16957?

@JeremyRubin

This comment has been minimized.

Copy link
Contributor

JeremyRubin commented Oct 29, 2019

@martinus yes it did -- are you not rebased?

Worth pointing out that a prevector<28> on master fits perfectly in a 32 byte or 64 byte cache line.

No longer after this change (which is fine, we can increase to prevector<31> where sensitive to it, like std::vector<prevector<>>.

Also, now that CAmounts are unaligned, they too can be extra slow to do something with.

I doubt that's what's going on here, but worth looking at before merge.

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 29, 2019

@martinus yes it did -- are you not rebased?

I didn't saw he linked to the master revision, so both master and my branch has #16957 so it's not that.

I suspect that there is something strange going on with the threading code. Maybe some timing issue causes lots of caching misses for some reason? I don't know much about how the checkqueue and surrounding classes work though.

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 29, 2019

I think I might have been able to reproduce this issue. I've tried ./bitcoind -datadir=/run/media/martinus/tera/bitcoin/db -reindex-chainstate -stopatheight=150000 -printtoconsole=0 -connect=0 -dbcache=15000 on master, and on an older version, before #16957. The old version used ~20 seconds of user time, master used 332 seconds. I've done git bisect and found this revision as the first bad commit: fa3a733

I don't know why, but this commit seems to have a dramatic effect on performance.

@sipa

This comment has been minimized.

Copy link
Member

sipa commented Oct 29, 2019

@martinus Are you aware that blocks before the assumevalid point don't get signature/script validated?

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented Oct 30, 2019

For benchmarking purposes, it might be best to set all nodes compiled from different branches to -noassumevalid or the same block hash.

@martinus

This comment has been minimized.

Copy link
Contributor Author

martinus commented Oct 30, 2019

@martinus Are you aware that blocks before the assumevalid point don't get signature/script validated?

No, I didn't know. Also it seems that since I didn't yet have this block, -reindex-chainstate signature/script validates everything

It might be helpful to have a list of assumevalid blocks instead of just a single block, to prevent revalidating everything when that specific marker block is not yet there

martinus added 3 commits Sep 25, 2019
This change reduces CCoinMap's value_type from 96 bytes to 80 bytes by
more tightly packing it's data. This is achieved by these changes:

* Refactored prevector so it uses a single byte to determine its size
  when in direct mode
* Reduce CScriptBase from 28 to 27 indirect bytes
* Introduced PackableCAmount to be able to align CTxOut to 4 bytes to
  prevent padding when used as a member in Coin

This tighter packing means more data can be stored in the coinsCache
before it is full and has to be flushed to disk. In my benchmark,
-reindex-chainstate was 6% faster and used 6% less memory. The cache
could fit 14% more txo's before it had to resize.
Removed CCoinsCacheEntry's 1 byte for the flags and put it directly into
Coin. That way we can get rid of unnecessary padding, which reduces the
memory requirement for the coinsCache. We steal 2 bits from Coin's
nHeight, so now there are only 29 bits left. Still, we are save until
block 2^29-1 = 536870911.
This requires the operator=(CAmount) which I've implemented here as well. Using static_cast for nicer formatting.

Note that it would be cleaner to make the user defined conversion`operator CAmount()` explicit too, but that would mean that practically everywhere it is used we need to add an explicit cast.
See "C.164: Avoid implicit conversion operators" http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators
@martinus martinus force-pushed the martinus:2019-09-more-compact-Coin branch from 03cb535 to 22d1d94 Nov 8, 2019
@DrahtBot DrahtBot removed the Needs rebase label Nov 8, 2019
@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Nov 12, 2019

Did some benches over the weekend (bench/master.1 vs. bench/compactcoin.1) with the same setup/reindex-to-550,000 as mentioned in the allocator PR (#16801 (comment)):

host         tag                      time       time% maxmem  cpu%  dbcache
bench-ssd-2  master.1                 9:13:54    1.00  5770.10MB 344%  4000MB
bench-ssd-2  master.1                 9:15:57    1.00  5773.23MB 343%  4000MB
bench-ssd-2  compactcoin.1            9:07:01    0.98  5899.80MB 347%  4000MB
bench-ssd-2  compactcoin.1            9:07:41    0.99  5923.72MB 346%  4000MB

bench-ssd-3  compactcoin.1            9:06:32    0.98  5959.14MB 347%  4000MB
bench-ssd-3  compactcoin.1            9:07:35    0.98  6013.83MB 346%  4000MB
bench-ssd-3  master.1                 9:17:00    1.00  5771.56MB 342%  4000MB
bench-ssd-3  master.1                 9:17:24    1.00  5773.17MB 342%  4000MB

bench-ssd-4  compactcoin.1            9:08:11    0.98  5936.75MB 346%  4000MB
bench-ssd-4  compactcoin.1            9:07:46    0.98  5916.50MB 346%  4000MB
bench-ssd-4  master.1                 9:13:56    0.99  5770.13MB 344%  4000MB
bench-ssd-4  master.1                 9:17:07    1.00  5760.78MB 342%  4000MB

bench-ssd-5  compactcoin.1            9:03:06    0.98  5943.29MB 349%  4000MB
bench-ssd-5  compactcoin.1            9:04:24    0.98  5975.77MB 348%  4000MB
bench-ssd-5  master.1                 9:11:54    1.00  5769.57MB 346%  4000MB
bench-ssd-5  master.1                 9:13:56    1.00  5775.04MB 344%  4000MB

bench-strong compactcoin.1            14:59:15   0.99  5710.87MB 546%  4000MB
bench-strong compactcoin.1            14:59:00   0.99  5701.12MB 548%  4000MB
bench-strong master.1                 15:06:04   1.00  5720.08MB 544%  4000MB
bench-strong master.1                 15:04:26   1.00  5745.98MB 545%  4000MB

See modest 1-2% improvement in runtime and (oddly) on the SSD machines I'm seeing significantly higher memory usage (usually ~200MB) for this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.