port folly::JemallocNodumpAllocator #4534

yiwu-arbug · 2018-10-17T21:00:16Z

Summary:
Introduce JemallocNodumpAllocator, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set MADV_DONTDUMP to the memory. The implementation is basically the same as folly::JemallocNodumpAllocator, except for some minor difference:

It only support jemalloc >= 5.0
When the allocator destruct, it explicitly destruct the corresponding arena via arena.<i>.destroy via mallctl.

Depending on #4502.

yiwu-arbug · 2018-10-17T21:00:31Z

cc @interwq

facebook-github-bot

yiwu-arbug has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2018-10-17T22:16:31Z

include/rocksdb/cache_allocator.h

  // Returns the memory size of the block allocated at p. The default
  // implementation that just returns the original allocation_size is fine.
  virtual size_t UsableSize(void* /*p*/, size_t allocation_size) const {
    // default implementation just returns the allocation size
    return allocation_size;
  }
 };
+
+// Factory interface to obtain CacheAllocator instances.
+class CacheAllocatorFactory {


The name "cache allocator" is very confusing to me. Why not call it something like "block allocator" or something like that (because it allocates blocks)?

@igorcanadi introduce it in #4437. I don't think "block allocator" is more appropriate since we use it together with Cache, not BlockCache.

The allocator is used inside BlockFetcher, not cache. The class's function has nothing to do with cache at all. It's just that contained by a cache and users can get it from the cache object. Also, "cache allocator" can be easily misunderstood to be an allocator of caches. Actually it allocates memory. I guess MemoryAllocator is a bad idea as we have another allocator interface inherited by arena. How about MemoryAllocatorWrapper?

The fact that we only use it in BlockFetcher doesn't affect the allocator abstraction I guess. Just like the Cache didn't named BlockCache and we later reuse it as also as table cache. The allocator interface can be used later to allocate other memory. I don't see problem with CacheAllocator but I'm fine with MemoryAllocator. I don't think the class needs to be a wrapper though.

After an offline discussion with @lth, I think something close to MemoryAllocator might be more appropriate. For example, we might want to extend the no dump option to memtable, by reuse the same code and interface. @siying If MemoryAllocator is not a good idea, do you have any suggestions? :)

MemoryAllocator feels great to me.

@siying I'll make that change in a separate PR.

Actually I'll make the name change in #4502.

siying · 2018-10-17T22:17:13Z

include/rocksdb/cache_allocator.h

+  virtual const char* Name() const = 0;
+
+  virtual Status NewCacheAllocator(
+      std::unique_ptr<CacheAllocator>* cache_allocator) = 0;


Do we need to generate new cache allocator? Another word, do we need this factory at all? Can we just plug in an allocator to allocate blocks?

I come up with the factory class because I want to have ShardedCache handle allocator sharding. In a followup patch, I'll update ShardedCache to shard allocator by cache key. But yeah, if we want to shard in another way (round-robin?), or don't need sharding at all, the factory class is not needed.

What's your thoughts on the fragmentation in this per shard allocator setting?

According to @interwq, Jemalloc by default initiate 4 * cpu_cores arenas and assign static arena id to threads. Assume we run on server with 56 cores, that's 224 arenas. Compare with we have by default 64 arenas for block cache, arguably we should have less fragmentation.

For users run multiple DB instance each with dedicated block cache, they can (and should) use a shared CacheAllocator for those block caches, if they wish to get the no dump option.

Regarding the fact that with this approach, we are separating allocation for normal c++ objects and block contents. I'm arguing that normal c++ object tend to be smaller than block size. By separating allocation for them, arguably we should see fragmentation reduce (instead of increase) since mixing allocation of different size class tend to increase fragmentation.

If each allocator class contains configurable number of arenas, it sounds great to me, it is much better than the previous idea of arena per block cache shard.

@yiwu-arbug I'm not sure about the assumption that other allocations are all smaller than block size. RocksDB is a library, we don't know how users allocate memory. Even looking at RocksDB, large allcoation outside blocks are just as common, just to name some: malloc for reading uncompressed blocks, exponential readahead in iterators, memtable arena, compaction writeable file buffer, table builder's buffering of key hashes, indexes and bloom filter being built, etc.

Make sense. I'll leave the consideration as inline comment in the header file. I don't think we will turn the no dump options on by default. As with any feature, user bother to do so should evaluate if they are willing to make the tradeoff. Of cause we'll be careful with the MyRocks use case.

siying · 2018-10-17T22:18:20Z

include/rocksdb/cache_allocator.h

@@ -16,14 +22,35 @@ class CacheAllocator {
  // Name of the cache allocator, printed in the log
  virtual const char* Name() const = 0;

-  // Allocate a block of at least size size
+  // Allocate a block of at least size. Has to be thread-safe.
  virtual void* Allocate(size_t size) = 0;


Is there any reason it is "void*" rather than "char*"?

Again it was introduced in #4437. But since even malloc return void*, I think it is more natural. It means it don't imply typing. Again I don't think the CacaheAllocator interface is specific to block cache.

interwq · 2018-10-23T18:40:26Z

cache/jemalloc_nodump_allocator.cc

+
+JemallocNodumpAllocator::~JemallocNodumpAllocator() {
+  assert(arena_index_ != 0);
+  std::string key = "arena." + ToString(arena_index_) + ".destroy";


Do we actually need this functionality? Destroying an arena will also release all the memory managed by the arena, which should be done carefully. If the allocator is only created once and never released, it's fine not destroying it at shutdown.

On one hand, it is pretty safe to release all memory at this point. When the allocator destruct, it means the Cache object referencing the allocator is destroyed, which means the cache is not used any more.

On the other hand, the allocator is not necessary only destroy on shutdown. Our API allow user to open a RocksDB instance then close it and then continue with other stuff. Leaking the memory usage would be a bad idea.

I'll add a comment state the allocator should not be destruct while there's allocated memory in use.

interwq · 2018-10-23T18:54:25Z

cache/jemalloc_nodump_allocator.cc

+  assert(success || original_alloc == expected);
+
+  // Set the custom hook.
+  std::unique_ptr<extent_hooks_t> new_hooks(new extent_hooks_t(*hooks));


It looks we expect there to be only 1 hooks globally? If so a static one (maybe guard this part along with the compare_exchange above) might be better, since the lifetime management for hooks can get tricky.

I would like to have separate hooks for each allocator instance, but I didn't find a way to pass non-static member method to mallocx as new hook. Any suggestion?

Not sure I totally follow. Having a extent_hooks_t member in the allocator class should work? Again we will need to ensure the hooks_t is destructed after destroying the arena.

Sorry, I didn't explain the problem well. Since extent_hooks_t is a typedef of function pointer, I can only use a static method (JemallocNodumpAllocator::Alloc) as the hook. From the static method it is not able to access non-static members. That's why I'm forced to make original_alloc_ static. But if there's a way to make it non-static, I'll try.

And yes I'm holding the extent_hooks_t as non-static member in the class and it will get destruct after arena destruct, by destructor order. So it should work?

facebook-github-bot · 2018-10-26T06:37:24Z

@yiwu-arbug has updated the pull request. Re-import the pull request

facebook-github-bot · 2018-10-26T06:37:25Z

@yiwu-arbug has updated the pull request. Re-import the pull request

facebook-github-bot

yiwu-arbug has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

interwq

LGTM!

facebook-github-bot · 2018-10-26T22:59:24Z

@yiwu-arbug has updated the pull request. Re-import the pull request

facebook-github-bot

yiwu-arbug has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yiwu-arbug · 2018-10-27T00:04:52Z

@anand1976 I'm landing this one. If you have any comment, I'll address them in subsequent patches.

yiwu-arbug requested a review from anand1976 October 17, 2018 21:00

facebook-github-bot added the CLA Signed label Oct 17, 2018

facebook-github-bot reviewed Oct 17, 2018

View reviewed changes

siying reviewed Oct 17, 2018

View reviewed changes

yiwu-arbug changed the title ~~Introduce JemallocNodumpAllocator~~ port folly::JemallocNodumpAllocator Oct 19, 2018

interwq reviewed Oct 23, 2018

View reviewed changes

facebook-github-bot reviewed Oct 26, 2018

View reviewed changes

interwq approved these changes Oct 26, 2018

View reviewed changes

Introduce JemallocNodumpAllocator

99f3860

yiwu-arbug force-pushed the dump_jemalloc branch from 24cb318 to 99f3860 Compare October 26, 2018 22:59

facebook-github-bot reviewed Oct 26, 2018

View reviewed changes

facebook-github-bot closed this in 5f5fdda Oct 27, 2018

yiwu-arbug deleted the dump_jemalloc branch October 27, 2018 19:32

yiwu-arbug mentioned this pull request Nov 12, 2018

Update history and version for future 5.18 release. #4669

Closed

port folly::JemallocNodumpAllocator #4534

port folly::JemallocNodumpAllocator #4534

Conversation

yiwu-arbug commented Oct 17, 2018 • edited Loading

yiwu-arbug commented Oct 17, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 26, 2018

facebook-github-bot commented Oct 26, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

interwq left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 26, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

yiwu-arbug commented Oct 27, 2018

yiwu-arbug commented Oct 17, 2018 •

edited

Loading