Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid allocationClassSizeFactor #30

Closed
tangliisu opened this issue Aug 18, 2021 · 9 comments
Closed

invalid allocationClassSizeFactor #30

tangliisu opened this issue Aug 18, 2021 · 9 comments

Comments

@tangliisu
Copy link
Contributor

tangliisu commented Aug 18, 2021

When I test cachelib in a test cluster, an error is thrown
E0818 21:25:01.105298 14 cachelib_cache_handler.cpp:54] invalid factor 6.93298464824273e-310
it throws from the check from MemoryAllocator.cpp
generateAllocSizes { if (factor <= 1.0) { throw std::invalid_argument(folly::sformat("invalid factor {}", factor)); } }

The way i create cachelib instance is

    Cache::Config config;
    config.setCacheSize(cache_size).setCacheName(cache_name).validate();
    return std::make_unique<Cache>(config);

where cache_size = 76GB.

I didn't set allocationClassSizeFactor anywhere and i think it is defaulted as 1.25? I am not sure what this config (allocationClassSizeFactor) is and why it is 0.

@tangliisu
Copy link
Contributor Author

tangliisu commented Aug 18, 2021

I take some local test and cachelib works well. Just crashed when i go to test cluster

@sathyaphoenix
Copy link
Contributor

That's strange. The default in the config is 1.25 here (https://github.com/facebookincubator/CacheLib/blob/main/cachelib/allocator/CacheAllocatorConfig.h#L569).

Can you share the stack trace of the exception and also log the config.allocationClassSizeFactor before creating the cache through make_unique.

Does it always crash and does the error happen when you manually set the factor through setDefaultAllocSizes() ?

@tangliisu
Copy link
Contributor Author

tangliisu commented Aug 19, 2021

Hi I will private build to check config.allocationClassSizeFactor tomorrow and check if it will crash if i manually set the factor tomorrow. this is the stack trace of the exception.

E0818 21:25:01.105298    14 cachelib_cache_handler.cpp:54] invalid factor 6.93298464824273e-310
E0818 21:25:01.105343    14 ExceptionTracer.cpp:210] terminate() called, exception stack follows
E0818 21:25:01.105351    14 ExceptionTracer.cpp:212] Exception type: std::invalid_argument (14 frames)
    @ 00007fa0a90be092 __cxa_throw
                       /opt/folly/folly/experimental/exception_tracer/ExceptionTracerLib.cpp:58
    @ 0000564ab86e010c facebook::cachelib::MemoryAllocator::generateAllocSizes(double, unsigned int, unsigned int, bool) [clone .cold.442]
                       /opt/cachelib/cachelib/allocator/memory/MemoryAllocator.cpp:187
    @ 0000564abf74dede facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait>::getAllocatorConfig(facebook::cachelib::CacheAllocatorConfig<facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait> > const&)
                       /opt/cachelib/cachelib/../cachelib/allocator/Util.h:150
                       -> /opt/cachelib/cachelib/allocator/CacheAllocator.cpp
    @ 0000564abf79c75f facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait>::CacheAllocator(facebook::cachelib::CacheAllocatorConfig<facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait> >)
                       /opt/cachelib/cachelib/../cachelib/allocator/CacheAllocator-inl.h:34
                       -> /opt/cachelib/cachelib/allocator/CacheAllocator.cpp
    @ 0000564ab8de0021 cache_util::CreateCachelib(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
                       /usr/include/c++/8/bits/unique_ptr.h:835
                       -> /proc/self/cwd/common/cache_util/cachelib_cache_handler.cpp
    @ 0000564ab8de4fbc cache_util::CachelibCacheHandler::CachelibCacheHandler(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cache_util::SegmentInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, cache_util::SegmentInfo> > > const&)
                       /proc/self/cwd/common/cache_util/cachelib_cache_handler.cpp:64
    @ 0000564ab8ba8197 scorpion::CreateCacheHandler(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<cache_util::CacheHandler>*)
                       /usr/include/c++/8/ext/new_allocator.h:136
                       -> /proc/self/cwd/scorpion_v2/utils.cpp
    @ 0000564ab88ef10b CreateScorpionHandlerV2(std::unique_ptr<scorpion::ScorpionHandlerV2, std::default_delete<scorpion::ScorpionHandlerV2> >*, std::shared_ptr<scorpion::model_server::ModelServer>*)
                       /proc/self/cwd/scorpion_v2/scorpion_v2.cpp:153
    @ 0000564ab86f18b7 main
                       /proc/self/cwd/scorpion_v2/scorpion_v2.cpp:238
    @ 00007fa09b3cbbf6 __libc_start_main
    @ 0000564ab88e7849 _start

E0818 21:25:01.129964    14 ExceptionTracer.cpp:214] exception stack complete
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid factor 6.93298464824273e-310

@tangliisu
Copy link
Contributor Author

Hi @sathyaphoenix I tried to private build again. Surprisingly cachelib is not crashed and allocationClassFSizeFactor is as expected.
E0819 20:13:54.022516 14 cachelib_cache_handler.cpp:52] Cachelib allocationClassFSizeFactor in config is: 1.25
I think right now everything is good. Feel free to closing the ticket. (although i still don't know why allocationClassFSizeFactor becomes 0 sometimes)

@sathyaphoenix
Copy link
Contributor

Thanks for confirming. If you can, please run with ASAN enabled and see if it can provide more information. For now, I am closing this issue. Please reopen if this re-appears and needs investigation.

@tangliisu
Copy link
Contributor Author

tangliisu commented Sep 2, 2021

Hi @sathyaphoenix finally we find the root cause is we set -DFOLLY_SSE=0 to support AVX512 compiler optimizer. But cachelib requires folly::dynamic and f14map in nvmconfig and f14map requires at least FOLLY_SSE=2. I think cachelib does not check this case but just throws an error with a confusing error message.

The error does not appear in private build is because we don't use any compiler optimizer in private build pipeline. After setting folly_sse=2 in our master build pipeline, the error goes away. Do you think we can add an additional check or have a comment in nvmconfig to avoid this issue?

@sathyaphoenix sathyaphoenix reopened this Sep 2, 2021
@sathyaphoenix
Copy link
Contributor

@tangliisu Can you share the confusing error message that you see and also more details on how this causes the double value to be ~0. Also, please note that NvmConfig has moved away from using folly::dynamic in the main branch and it has simple declarative api to configure it. https://cachelib.org/docs/Cache_Library_User_Guides/Configure_HybridCache) .. We do rely on F14Map though. Once you share the error message, we can look into an appropriate work around.

@tangliisu
Copy link
Contributor Author

tangliisu commented Sep 2, 2021

Thanks for the info. We pin cachelib to an old version so nvmconfig is still there.

I could not reproduce the error message allocationClassFSizeFactor ~0 in recent build. Recently the bad build error stack trace is

F0902 20:10:39.052548    14 dynamic.cpp:137] Check failed: 0 
*** Check failure stack trace: ***
    @     0x7f8d2b6739bd  google::LogMessage::Fail()
    @     0x7f8d2b6758a8  google::LogMessage::SendToLog()
    @     0x7f8d2b673563  google::LogMessage::Flush()
    @     0x7f8d2b6762f9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f8d2b0ce716  folly::dynamic::operator=()
    @     0x562fb47e392e  facebook::cachelib::NvmCache<>::Config::Config()
    @     0x562fb47ed0b3  facebook::cachelib::CacheAllocatorConfig<>::CacheAllocatorConfig()
    @     0x562fb483b122  facebook::cachelib::CacheAllocator<>::CacheAllocator()
    @     0x562fade7cb2a  cache_util::CreateCachelib()
    @     0x562fade80c32  cache_util::CachelibCacheHandler::CachelibCacheHandler()
    @     0x562fadc40b48  scorpion::CreateCacheHandler()
    @     0x562fad98330c  CreateScorpionHandlerV2()
    @     0x562fad7853b8  main
    @     0x7f8d1aa53bf7  (unknown)
    @     0x562fad97b61a  _start

which makes sense. But i happened to get the confusing ~0 error before we figured out the FOLLY_SSE=0 issue

E0818 21:25:01.129964    14 ExceptionTracer.cpp:214] exception stack complete
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid factor 6.93298464824273e-310

If cachelib still rely on F14Map, i guess we need to have FOLLY_SSE=2.

BTW we implemented cachelib in our system. The perf is very impressive. We are still working on tuning the cachelib to see if we could further reduce the CPU usage.

@sathyaphoenix
Copy link
Contributor

Great to hear it is working out as expected. Let us know if you need any information for tuning.

It is strange though that not setting FOLLY_SSE=2 would cause an unrelated double to be broken. cc @agordon if he has any insights to share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants