New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase LevelDB max_open_files #12495

Merged
merged 1 commit into from Mar 29, 2018

Conversation

Projects
None yet
8 participants
@eklitzke
Member

eklitzke commented Feb 20, 2018

Currently we set max_open_files = 64 on all architectures due to concerns about file descriptor exhaustion. This is extremely expensive due to how LevelDB is designed.

When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

The current setting of 64 file handles means that on a synced node, only about 4% of key lookups can be satisifed by table file handles that are actually open and in memory.

The original concerns about file descriptor exhaustion are unwarranted on most systems because:

  • On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using mmap(), and it does not retain an open file descriptor for such files.
  • On Windows non-socket files do not interfere with the main network select() loop, so the same fd exhaustion issues do not apply there.

This change keeps the default max_open_files value (which is 1000) on all systems except 32-bit POSIX hosts (which do not use mmap()). Open file handles use about 20 KB of memory (for the block index), so the extra file handles do not cause much memory overhead. At most 1000 will be open, and a fully synced node right now has about 1500 such files.

Profile of loadblk thread before changes: https://monad.io/maxopenfiles-master.svg
Profile of loadblk thread after changes: https://monad.io/maxopenfiles-increase.svg

@@ -71,14 +71,41 @@ class CBitcoinLevelDBLogger : public leveldb::Logger {
}
};
constexpr int PermissibleMaxOpenFiles() {

This comment has been minimized.

@laanwj

laanwj Feb 22, 2018

Member

Add static?

This comment has been minimized.

@eklitzke

eklitzke Feb 22, 2018

Member

The static specifier isn't necessary because constexpr functions are evaluated at compile time. I double checked that nm ./src/bitcoind | c++filt | grep Permissible produces no output.

@@ -71,14 +71,41 @@ class CBitcoinLevelDBLogger : public leveldb::Logger {
}
};
constexpr int PermissibleMaxOpenFiles() {
#ifdef _POSIX_C_SOURCE

This comment has been minimized.

@laanwj

laanwj Feb 22, 2018

Member

I don't think this does what you expect it to. As I understand it, confirmed by the gcc page about it, _POSIX_C_SOURCE is supposed to be defined explicitly at the top of the compilation unit to request certain POSIX C library features. It is not defined by the compiler. So it can't be used to detect whether the platform compiled to is POSIX.

In other place, we've just used #if[n]def WIN32 as WIN32 is the only non-POSIX platform supported in this project.

This comment has been minimized.

@eklitzke

eklitzke Feb 22, 2018

Member

You're absolutely right, I will fix this.

@eklitzke

This comment has been minimized.

Member

eklitzke commented Feb 22, 2018

I measured the impact of this change, you can see a graph here: https://monad.io/max_open_files.png

Method for creating this graph:

  • Ran bitcoind -reindex-chainstate on a full node that had already been synced on mainnet
  • Instance is a GCP n1-standard-1 host which has one vCPU pinned to host CPU, 4 GB of RAM, and very slow virtualized disks
@laanwj

This comment has been minimized.

Member

laanwj commented Feb 22, 2018

I measured the impact of this change, you can see a graph here: https://monad.io/max_open_files.png

Nice!

@eklitzke

This comment has been minimized.

Member

eklitzke commented Feb 22, 2018

The reason for this speedup isn't super obvious, and I think merits some explanation. I've spent a lot of time the last three days reading the LevelDB source code and think I understand now. It actually has nothing to do with reducing the number of open/close/mmap/munmap syscalls, as I had originally thought.

The LevelDB terminology for a .ldb file is a "table", and the tables are composed of 4 KB blocks. Each table contains a block index, which is an index that can be used to determine which block a given key will be in. The block index for a given table file is loaded into memory when that table is opened.

Increasing the max_open_files here forces Bitcoin to keep much more of the block index data actually loaded into memory. I think it might be worth actually increasing the internal limits of LevelDB on the number of mmap files beyond 1000 on 64-bit Unix platforms, as that would result in the entire block index being loaded into memory. However I think we should start with this, and then consider increasing the limits once I've done more investigation into what the overhead is.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Feb 22, 2018

@eklitzke Cheers for referencing #12123 . Replying here to get subscribed.

I am going to do more tests on a Windows machine at the weekend with your changes here and report back my results.

I am pretty familiar with Windows internals so I'll do further investigations to see if we can have similar changes on Windows.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Feb 24, 2018

I had a closer look at Windows and the way we use the win32 apis in LevelDB.

Observations:

  1. Win32RandomAccessFile does not use memory mapped files.
  2. The file opening and reading is done on the Win32 API level: CreateFile, ReadFile. There is no limit to the number of files that can be opened at this level (Use Window's HANDLES rather than FDs). Limits are at the C runtime library level but this is irrelevant.

I am confident that this increase to 1000 files will not affect Windows. I have already tested this but will do more tests tomorrow.

The Win32RandomAccessFile is as basic as it gets. Further optimizations can be made in this by using Memory Mapped files (Similar to the POSIX versions). The random access nature of accesses should be a win for Memory Mapped files over regular file read operations due to the kernel level caching.
I will modify this and see if any significant performance improvements are achieved. I reckon it will.

As I've mentioned in #12123, we achieve optimization at CPU level also by avoiding the CRC checksum check that's performed on each file open.

@eklitzke I hope you don't mind me replying on here. I'll continue to do more tests on the Windows sides of things. Any improvement to the IBD is a big win!

Cheers,
Donal

@eklitzke

This comment has been minimized.

Member

eklitzke commented Feb 26, 2018

I have a branch that measures how much extra memory this uses. On my node, the average block index size is 19797 bytes for the chainstate database. The bloom filters (which are also loaded in memory) are 6 bytes or 7 bytes depending on the file. Thus a reasonable approximation is 20 KB of data per chainstate .ldb file. This means on master the indexes use about 1 MB of memory, and with this branch they'll use about 20 MB of memory. If we loaded all of the indexes into memory (not possible right now due to the hard-coded 1000 file limit, but possible if we patched leveldb) that would be about 33 MB of memory.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 1, 2018

I ran bitcoind -reindex-chainstate on my Windows 10 PC with the full node synced.

First I ran using the default: max_open_files=64 (Blue series)
Next I ran using max_open_files=1000 (Red series).

The results show a substantial difference in sync time.

image

Time to reach 100% with max_open_files=64 was: 12h 15m
Time to reach 100% with max_open_files=1000 was: 5h 40m

That's an over 50% reduction in the time to reindex. I didn't expect such an improvement.

@achow101

This comment has been minimized.

Member

achow101 commented Mar 1, 2018

That's a pretty big speedup! Does this only work on 64 bit systems? Would it be possible to do it for all systems?

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 2, 2018

At least for POSIX systems, the issue is that LevelDB won't use mmap() on 32-bit systems because the virtual memory space isn't large enough. The chainstate database is larger than 4 GB. On such hosts, each open file handle really does use up a file descriptor, which can obviously lead to problems in a select(2) loop if the fd count exceeds 1024. I haven't posted this data yet, but I also have data showing a large speedup on POSIX hosts that have intermediate disk characteristics to what I posted earlier. That said: I really cannot imagine many users are running Bitcoin on 32-bit hosts, and it's easy enough to have conditional logic in that case anyway.

I've been working on other LevelDB improvements that will result in even more significant improvements than what I've posted here, but they're more invasive changes. So I'm interested in getting this change (or a variant of it) merged before I try to go deeper into the UTXO system.

@donaloconnor Can you comment on whether select() loops have the same 1024 fd limit on Windows?

@luke-jr

This comment has been minimized.

Member

luke-jr commented Mar 2, 2018

Windows by default has a limit of 64 descriptors in select(), but:

  1. This is a limit on the number of descriptors added to the fd_set, not a limit on total descriptors in the process.
  2. We override FD_SETSIZE to 1024 in compat.h.

(POSIX is supposed to allow changing FD_SETSIZE too, but this is broken in some versions of glibc)

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 2, 2018

Does this only work on 64 bit systems

The LevelDB's Win32 code does not use memory mapped files (Unlike the POSIX version) so this change should improve both 32/64bit versions of Windows.

@donaloconnor Can you comment on whether select() loops have the same 1024 fd limit on Windows?

What @luke-jr said above ^. Again since the LevelDB code for Win32 is using the Win32 ReadFile/CreateFile APIs, it should not affect any POSIX functions/fd descriptors.

I'm running with a limit of 64 open files again to see if I get similar results (12hour sync time). I'd be really happy if someone else can also test this on Windows. I am finding it hard to believe the 2x speed up but it does seem to be the case.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 2, 2018

Interesting stats from observing Disk I/O reported by Task Manager:

max_open_files=64 - seems to hover around 2 - 6 MB/s with bursts of 40MB/s +
max_open_files=1000 - Seemed to be consistently above 60 MB/s when ever I looked at it

This backs up the theory that since the files remain open, the file cache on Windows remains warm.

*Not scientific, just observations. Can do some real disk I/O profiling later.

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 3, 2018

I don't think it's related to the disk cache, it's due to the increased availability of the deserialized buffer indexes (and bloom filters) in memory. At least on Unix, files being open is unrelated to the availability of their contents in the page cache. In my measurements it takes four disk seeks just to open each .ldb file (which is related to parsing out the file header, bloom filter, and the block index), and the information is stored on disk in a compressed form that has to be deserialized. But once the file is open and in the table cache all that data is right there, ready to be accessed. I started writing up a blog post about this with details about how the table format works, how the LevelDB caches interact, and what's involved during key lookup. I'll try to finish that up in case people want to learn more about the details.

I'll update this PR to also increase the limit on Windows. I am interested in hearing if people would like to see any other changes. For instance, in IRC Greg asked if there's a way to tell from the LevelDB API if there's a way to find out for a given platform if the implementation is going to use up real file descriptors or not. There isn't, but I could patch LevelDB to expose that if that's something reviewers want to see.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 3, 2018

Yes, I'm also now inclined to believe that it's probably unrelated to disk cache. Actually I've seen higher utilization of CPU in the 1000 limit case (mine was about 60-80% CPU usage, where as with the 64 limit it was just 30% mostly.

Anyway, I just finished the tests again a few minutes ago:

max_open_files=64 -reindex-chainstate time: 13h 30m
max_open_files=1000 -reindex-chainstate time: 6h 30m

Another 2x improvement.

I started writing up a blog post about this with details about how the table format works, how the LevelDB caches interact, and what's involved during key lookup. I'll try to finish that up in case people want to learn more about the details.

Looking forward to it 👍

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 3, 2018

Can you let me know if the comment looks correct regarding Windows behavior?

// See PR #12495 for further discussion.
#ifndef WIN32
if (sizeof(void*) < 8)
options->max_open_files = 64;

This comment has been minimized.

@luke-jr

luke-jr Mar 3, 2018

Member

Prefer, to be safe:

if (sizeof(void*) < 8 || leveldb::kMajorVersion != 1 || leveldb::kMinorVersion < 3 || leveldb::kMinorVersion > 20) {
    options->max_open_files = 64;
}
static leveldb::Options GetOptions(size_t nCacheSize)
{
leveldb::Options options;
options.block_cache = leveldb::NewLRUCache(nCacheSize / 2);
options.write_buffer_size = nCacheSize / 4; // up to two write buffers may be held in memory simultaneously
options.filter_policy = leveldb::NewBloomFilterPolicy(10);
options.compression = leveldb::kNoCompression;
options.max_open_files = 64;

This comment has been minimized.

@donaloconnor

donaloconnor Mar 3, 2018

Contributor

max_open_files is not set explicitly before calling SetMaxOpenFiles. Probably should be set in SetMaxOpenFiles explicitly to 1000 before the #ifndef

This comment has been minimized.

@luke-jr

luke-jr Mar 3, 2018

Member

The constructor for leveldb::Options sets it to 1000.

This comment has been minimized.

@donaloconnor

donaloconnor Mar 3, 2018

Contributor

I didn't scroll far another down :) - you're right.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 3, 2018

Can you let me know if the comment looks correct regarding Windows behavior?

Comment LGTM!

@luke-jr

This comment has been minimized.

Member

luke-jr commented Mar 3, 2018

Might be good to log somewhere which limit is being used...

#ifndef WIN32
    if (sizeof(void*) < 8 || leveldb::kMajorVersion != 1 || leveldb::kMinorVersion < 3 || leveldb::kMinorVersion > 20) {
        options->max_open_files = 64;
        LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d\n", options->max_open_files);
    } else {
        LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d (default)\n", options->max_open_files);
    }
#endif
@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 3, 2018

I've added the log statement. I also added a static_assert() on the exact LevelDB version. That way in the future if someone upgrades LevelDB, they will be forced to check that the POSIX RandomAccessFile implementation hasn't changed in a way that will cause fd exhaustion issues. What do you think of this approach?

@luke-jr

This comment has been minimized.

Member

luke-jr commented Mar 3, 2018

It's a booby trap that everyone using system LevelDB libraries will need to patch out. But maybe a good idea anyway, to re-enforce the danger of using other versions. But it should probably at least tolerate older versions back to 1.3 (which added mmap support).

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 3, 2018

Booby trap is intentional. I couldn't figure out how to make git log -GkMinorVersion src/leveldb work correctly with merge commits/subtrees, so I did this manually. From what I can tell the update history for LevelDB in the Bitcoin source tree is:

  • 2012-06-25 initial import of 1.5
  • 2012-12-12 upgrade to 1.7
  • 2013-08-24 upgrade to 1.13
  • 2013-12-12 upgrade to 1.15
  • 2014-05-09 upgrade to 1.17
  • 2014-10-16 upgrade to 1.18
  • 2016-12-01 upgrade to 1.19
  • 2017-06-08 upgrade to 1.20

Given that we only do the update once or twice a year I think it's reasonable to keep in there and leave it as a manual step for the person doing the upgrade. On the other hand, this might be overly paranoid because:

  • It's unlikely that upstream would stop using mmap, given that it's the obvious way to do things for speed reasons
  • Since the new logic is to use the default value in most cases, it seems unlikely that upstream would make a change that cause the default LevelDB options to use drastically different resources like fds
LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d\n", options->max_open_files);
}
#endif
LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d (default)\n", options->max_open_files);

This comment has been minimized.

@luke-jr

luke-jr Mar 3, 2018

Member

This will log twice!

@luke-jr

This comment has been minimized.

Member

luke-jr commented Mar 3, 2018

Considering that LevelDB upgrades already need to be checked for consensus compatibility, I guess it doesn't make sense to be stricter on this regard.

Perhaps a checklist for things to check on updates should be added to doc/developer-notes.md instead.

@luke-jr

This comment has been minimized.

Member

luke-jr commented Mar 3, 2018

    const char *default_indicator = " (default)";
#ifndef WIN32
    if (sizeof(void*) < 8) {
        options->max_open_files = 64;
        default_indicator = "";
    }
#endif
    LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d%s\n", options->max_open_files, default_indicator);
@laanwj

This comment has been minimized.

Member

laanwj commented Mar 5, 2018

Using system leveldb is generally discouraged, because we have our own fork of leveldb and we only test with that one. I'd say if you really, really want to use that, patching out a boobytrap is the least of worries.

Good idea to log the value, but yes it will get logged twice in some cases now.

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 5, 2018

This seems a bit overly complicated imo. We could just log both values (set and default):

int default_max = options->max_open_files;

#ifndef WIN32
    if (sizeof(void*) < 8) {
        options->max_open_files = 64;
    } else
#endif

LogPrint(BCLog::LEVELDB, "LevelDB max_open_files=%d (Default=%d)\n", options->max_open_files, default_max);

Also, I feel the static_asset seems a bit intrusive . As, eklitzke pointed out, it's unlikely use of mmap will be changed and we maintain our own fork of leveldb anyway. A note in doc/developer-notes.md would be sufficient.

first introduced). Therefore when upgrading LevelDB you should review the
upstream changes to check for issues affecting consensus compatibility. For
example, an upstream change that affects CRC handling could affect consensus for
Bitcoin nodes with `.ldb` files that contain CRC errors.

This comment has been minimized.

@luke-jr

luke-jr Mar 6, 2018

Member

Eh, that's not what I meant... In this example:

You set database key ABC to DEF; you then get database key ABC, but due to a bug in LevelDB, it answers that the key doesn't exist.

If you now fix this LevelDB bug, you have broken consensus because all the other nodes expect key ABC to be non-existent.

@ryanofsky

Conditional utACK 32372e9 as long as you make Luke's suggested documentation fix from #12495 (comment), and clarify that only leveldb bugs returning incorrect key/value data would be a cause of concern for consensus.

references to the table file descriptors. If you are upgrading LevelDB, you must
sanity check the changes to make sure that this assumption remains valid.
In addition to reviewing the upstream changes in `env_posix.cc`, you can use `lsof` to

This comment has been minimized.

@ryanofsky

ryanofsky Mar 6, 2018

Contributor

Could also suggest checking the log for warnings like:

LogPrintf("Cannot create connection: non-selectable socket created (fd >= FD_SETSIZE ?)\n");

@jimpo

This comment has been minimized.

Contributor

jimpo commented Mar 9, 2018

ACK 32372e9

Tested on Linux x86_64. Reduced initial sync time on an EC2 m5.xlarge from 8:31:01 to 6:36:25.

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 9, 2018

Updated and rebased to reword the consensus compatibility issue that @luke-jr noted.

@luke-jr

utACK

@donaloconnor

This comment has been minimized.

Contributor

donaloconnor commented Mar 10, 2018

Can we change the title to include windows. Just for future reference.

@eklitzke eklitzke changed the title from Increase LevelDB max_open_files on 64-bit POSIX systems to Increase LevelDB max_open_files On Most Architectures Mar 10, 2018

@eklitzke eklitzke changed the title from Increase LevelDB max_open_files On Most Architectures to Increase LevelDB max_open_files Mar 10, 2018

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 10, 2018

Updated the title for this PR, and rewrote the pull request description to reflect the discoveries/changes made since I opened the PR.

@ryanofsky

This comment has been minimized.

Contributor

ryanofsky commented Mar 12, 2018

utACK 21e2144. Only change since last review is in developer notes.

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 15, 2018

@laanwj Can you take another look at this?

@MeshCollider MeshCollider requested a review from laanwj Mar 15, 2018

@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 22, 2018

@donaloconnor As you acked this, can you change your review status? It stills shows up in GitHub as "requested changes".

@laanwj Sorry to ping you on this again, but you requested changes on this earlier and I think this is ready to go (unless you have further objections). I'm trying to not stack PRs in the dbwrapper code, and this is preventing me from submitting further changes for review.

Increase LevelDB max_open_files unless on 32-bit Unix.
This change significantly increases IBD performance by increasing the
amount of the UTXO index that can remain in memory. To ensure this
doesn't cause problems in the future, a static_assert on the LevelDB
version has been added, which must be updated by anyone upgrading
LevelDB.
@eklitzke

This comment has been minimized.

Member

eklitzke commented Mar 29, 2018

Rebased with master, as other dbwrapper changes have been merged since my last update.

@laanwj

laanwj approved these changes Mar 29, 2018

@laanwj

This comment has been minimized.

Member

laanwj commented Mar 29, 2018

utACK ccedbaf - with the comment that I think relying on undocumented behavior of leveldb is risky. This is only remotely acceptable because we have our own leveldb tree (https://github.com/bitcoin-core/leveldb), and don't blindly merge changes from upstream. I also think you have properly documented this.

@laanwj laanwj merged commit ccedbaf into bitcoin:master Mar 29, 2018

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

laanwj added a commit that referenced this pull request Mar 29, 2018

Merge #12495: Increase LevelDB max_open_files
ccedbaf Increase LevelDB max_open_files unless on 32-bit Unix. (Evan Klitzke)

Pull request description:

  Currently we set `max_open_files = 64` on all architectures due to concerns about file descriptor exhaustion. This is extremely expensive due to how LevelDB is designed.

  When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

  If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

  The current setting of 64 file handles means that on a synced node, only about 4% of key lookups can be satisifed by table file handles that are actually open and in memory.

  The original concerns about file descriptor exhaustion are unwarranted on most systems because:
   * On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using `mmap()`, and it does not retain an open file descriptor for such files.
   * On Windows non-socket files do not interfere with the main network `select()` loop, so the same fd exhaustion issues do not apply there.

  This change keeps the default `max_open_files` value (which is 1000) on all systems except 32-bit POSIX hosts (which do not use `mmap()`). Open file handles use about 20 KB of memory (for the block index), so the extra file handles do not cause much memory overhead. At most 1000 will be open, and a fully synced node right now has about 1500 such files.

  Profile of `loadblk` thread before changes: https://monad.io/maxopenfiles-master.svg
  Profile of `loadblk` thread after changes: https://monad.io/maxopenfiles-increase.svg

Tree-SHA512: de54f77d57e9f8999eaf8d12592aab5b02f5877be8fa727a1f42cf02da2693ce25846445eb19eb138ce4e5045d1c65e14054df72faf3ff32c7655c9cfadd27a9

sidhujag added a commit to syscoin/syscoin that referenced this pull request Jun 5, 2018

underdarkskies referenced this pull request in underdarkskies/Ravencoin Jul 19, 2018

Merge #12495: Increase LevelDB max_open_files
ccedbaf Increase LevelDB max_open_files unless on 32-bit Unix. (Evan Klitzke)

Pull request description:

  Currently we set `max_open_files = 64` on all architectures due to concerns about file descriptor exhaustion. This is extremely expensive due to how LevelDB is designed.

  When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

  If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

  The current setting of 64 file handles means that on a synced node, only about 4% of key lookups can be satisifed by table file handles that are actually open and in memory.

  The original concerns about file descriptor exhaustion are unwarranted on most systems because:
   * On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using `mmap()`, and it does not retain an open file descriptor for such files.
   * On Windows non-socket files do not interfere with the main network `select()` loop, so the same fd exhaustion issues do not apply there.

  This change keeps the default `max_open_files` value (which is 1000) on all systems except 32-bit POSIX hosts (which do not use `mmap()`). Open file handles use about 20 KB of memory (for the block index), so the extra file handles do not cause much memory overhead. At most 1000 will be open, and a fully synced node right now has about 1500 such files.

  Profile of `loadblk` thread before changes: https://monad.io/maxopenfiles-master.svg
  Profile of `loadblk` thread after changes: https://monad.io/maxopenfiles-increase.svg

Tree-SHA512: de54f77d57e9f8999eaf8d12592aab5b02f5877be8fa727a1f42cf02da2693ce25846445eb19eb138ce4e5045d1c65e14054df72faf3ff32c7655c9cfadd27a9

underdarkskies referenced this pull request in underdarkskies/Ravencoin Aug 1, 2018

Merge #12495: Increase LevelDB max_open_files
ccedbaf Increase LevelDB max_open_files unless on 32-bit Unix. (Evan Klitzke)

Pull request description:

  Currently we set `max_open_files = 64` on all architectures due to concerns about file descriptor exhaustion. This is extremely expensive due to how LevelDB is designed.

  When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

  If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

  The current setting of 64 file handles means that on a synced node, only about 4% of key lookups can be satisifed by table file handles that are actually open and in memory.

  The original concerns about file descriptor exhaustion are unwarranted on most systems because:
   * On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using `mmap()`, and it does not retain an open file descriptor for such files.
   * On Windows non-socket files do not interfere with the main network `select()` loop, so the same fd exhaustion issues do not apply there.

  This change keeps the default `max_open_files` value (which is 1000) on all systems except 32-bit POSIX hosts (which do not use `mmap()`). Open file handles use about 20 KB of memory (for the block index), so the extra file handles do not cause much memory overhead. At most 1000 will be open, and a fully synced node right now has about 1500 such files.

  Profile of `loadblk` thread before changes: https://monad.io/maxopenfiles-master.svg
  Profile of `loadblk` thread after changes: https://monad.io/maxopenfiles-increase.svg

Tree-SHA512: de54f77d57e9f8999eaf8d12592aab5b02f5877be8fa727a1f42cf02da2693ce25846445eb19eb138ce4e5045d1c65e14054df72faf3ff32c7655c9cfadd27a9

underdarkskies referenced this pull request in underdarkskies/Ravencoin Aug 1, 2018

Merge #12495: Increase LevelDB max_open_files
ccedbaf Increase LevelDB max_open_files unless on 32-bit Unix. (Evan Klitzke)

Pull request description:

  Currently we set `max_open_files = 64` on all architectures due to concerns about file descriptor exhaustion. This is extremely expensive due to how LevelDB is designed.

  When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

  If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

  The current setting of 64 file handles means that on a synced node, only about 4% of key lookups can be satisifed by table file handles that are actually open and in memory.

  The original concerns about file descriptor exhaustion are unwarranted on most systems because:
   * On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using `mmap()`, and it does not retain an open file descriptor for such files.
   * On Windows non-socket files do not interfere with the main network `select()` loop, so the same fd exhaustion issues do not apply there.

  This change keeps the default `max_open_files` value (which is 1000) on all systems except 32-bit POSIX hosts (which do not use `mmap()`). Open file handles use about 20 KB of memory (for the block index), so the extra file handles do not cause much memory overhead. At most 1000 will be open, and a fully synced node right now has about 1500 such files.

  Profile of `loadblk` thread before changes: https://monad.io/maxopenfiles-master.svg
  Profile of `loadblk` thread after changes: https://monad.io/maxopenfiles-increase.svg

Tree-SHA512: de54f77d57e9f8999eaf8d12592aab5b02f5877be8fa727a1f42cf02da2693ce25846445eb19eb138ce4e5045d1c65e14054df72faf3ff32c7655c9cfadd27a9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment