Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalancing errors leading to a crash #112

Closed
erangi opened this issue Jan 18, 2022 · 7 comments
Closed

Rebalancing errors leading to a crash #112

erangi opened this issue Jan 18, 2022 · 7 comments

Comments

@erangi
Copy link

erangi commented Jan 18, 2022

Hi. I'm getting weird errors while working with CacheLib. The cache creation and first operations complete with no issues, but after a few seconds I start seeing a series of reports such as: E0108 16:48:53.001791 315636 PoolRebalancer.cpp:50] Rebalancing interrupted due to exception: PoolId 80 is not a compact cache, eventually ending with a segfault at: facebook::cachelib::MemoryPoolManager::getPoolById(signed char) const+0x20

This smells like a memory issue, but it's consistent so I don't think it's some random memory overwrite. Can you spot the root cause?

Context:

  1. This exact issue happens on 3 machines - RHEL 7, RHEL 8, and Ubuntu 20. I know the RHELs aren't supported but I was able to build and the issue is consistent.
  2. I wrap CacheLib in JNI so I can benchmark it using YCSB (which is the tool we use for benchmarking other caches). CacheLib use is pretty straightforward, see code below.
  3. If I build the JNI wrapper using a cmake config similar to the CacheLib examples the build succeeds but some symbols are missing at launch. I therefore link CacheLib's libraries in the cmake, not sure why my case is different from the example and if this is relevant. See cmake below.
  4. CacheLib's code is from a few days ago.

My CacheLib client code (snippets from the cachelib_api.cpp referred to by cmake):

Cache::NvmCacheConfig nvmConfig;
nvmConfig.navyConfig.setBlockSize(4096);
nvmConfig.navyConfig.setSimpleFile(cacheFile, fileCapacity, true);
nvmConfig.navyConfig.blockCache().setRegionSize(4096);
nvmConfig.navyConfig.setReaderAndWriterThreads(readerThreads, writerThreads);
nvmConfig.navyConfig.blockCache().setDataChecksum(true);
CacheConfig config;
config
    .setCacheSize(dramCapacity)
    .setCacheName("benchmarks")
    .setAccessConfig(expectedKeysInCache)
    .enableNvmCache(nvmConfig)
    .validate();
Cache* cache = new Cache(config);
long cacheSize = cache->getCacheMemoryStats().cacheSize;
defaultPool = cache->addPool("default", cacheSize);

optional<string> get(Cache* cache, const string& key)
{
    auto val = cache->find(key);
    if (!val)
        return {};
    return string(reinterpret_cast<const char*>(val->getMemory()), val->getSize());
}

bool put(Cache* cache, const string& key, const string& value)
{
    auto handle = cache->allocate(defaultPool, key, value.size());
    if (!handle)
        return false;
    std::memcpy(handle->getMemory(), value.data(), value.size());
    cache->insertOrReplace(handle);
    return true;
}

CMakeLists.txt looks as follows:

cmake_minimum_required(VERSION 3.17)
project(cachelib_api)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_STANDARD 17)
find_package(JNI REQUIRED)
include_directories(${JNI_INCLUDE_DIRS})
include_directories(${CMAKE_INSTALL_PREFIX}/include)
set(SOURCE_FILES src/main/cpp/cachelib_api.cpp)
find_package(cachelib CONFIG REQUIRED)
add_library(cachelib_api SHARED ${SOURCE_FILES})
target_link_libraries(cachelib_api PUBLIC
  ${CMAKE_INSTALL_PREFIX}/lib/libcachelib_datatype.a
  ${CMAKE_INSTALL_PREFIX}/lib/libcachelib_allocator.a
  ${CMAKE_INSTALL_PREFIX}/lib/libcachelib_common.a
  ${CMAKE_INSTALL_PREFIX}/lib/libcachelib_shm.a
  ${CMAKE_INSTALL_PREFIX}/lib/libcachelib_navy.a
  ${CMAKE_INSTALL_PREFIX}/lib/libthriftcpp2.so
)
@sathyaphoenix
Copy link
Contributor

E0108 16:48:53.001791 315636 PoolRebalancer.cpp:50] Rebalancing interrupted due to exception: PoolId 80 is not a compact cache, eventually ending with a segfault at: facebook::cachelib::MemoryPoolManager::getPoolById(signed char) const+0x20

Yeah. It sounds like a code path expects compact cache to be present, but it is infact not a compact cache. I suppose in the setup, you don't create any compact caches and have just a single default pool ?

One suggestion would be to stop pool rebalancer explicitly on startup through this api and see if the problem resurfaces as a different one.

PoolId XX is not a compact cache

this is thrown by the api here and from my quick digging, it should not be called by any pool rebalancing code.

@erangi
Copy link
Author

erangi commented Jan 19, 2022

Thanks @sathyaphoenix. I indeed don't create a compact cache and I indeed only use a single pool (the code above is practicly all the code that uses CacheLib). Stopping the pool rebalancer as you suggested allowed my runs to complete without an issue, don't know if it solved or just hid the real issue but it's a step forward. What's the impact of having no pool rebalancer? Is it an acceptable setup for a single cache and a single pool? I wanna make sure I'm not degrading performance with that.

@therealgymmy
Copy link
Contributor

therealgymmy commented Jan 19, 2022

@erangi: running without pool rebalancing may cause memory inbalance (e.g. some allocation sizes have no memory to allocate while others have excessive free memory).

Can you share your cache setup and repro instruction? (Both with the rebalancing crash and with the change that made it go away).

Looking at the error message, it smells like corruption if you just have a single pool. The message suggests there exists a PoolId 80 (81st pool created).

Rebalancing interrupted due to exception: PoolId 80 is not a compact cache

@erangi
Copy link
Author

erangi commented Jan 20, 2022

@therealgymmy the cache setup is in my first post, it's pretty straightforward. The only unconventional thing I do is wrap this client code in JNI so I can benchmark CacheLib using YCSB (which is Java). The extended YCSB and JNI wrapper are WIP so I'd rather not publicly share them but I can share them privately if that helps. I agree this looks like memory corruption but it's consistent over runs and machines (except the pool ID, which is often 80 but varies).

FWIW, our benchmarks use a fixed key and value sizes (less realistic but simpler...), so I don't think this synthetic use case will be very challenging for a memory manager.

@therealgymmy
Copy link
Contributor

@erangi: if it's fixed key and value size, then you can disable rebalancer. It won't do much anyway.

@erangi
Copy link
Author

erangi commented Jan 22, 2022

Sounds good, thanks!

@erangi erangi closed this as completed Jan 22, 2022
@waterxjw
Copy link

waterxjw commented Dec 4, 2023

The extended YCSB and JNI wrapper are WIP

@erangi I 'm sorry to bother you. I also want to benchmark CacheLib using YCSB. But I 'm not famliar with YCSB.
Would you be willing to share the code or insights from your work on this integration?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants