Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock with tbb #1402

Closed
mylanyuer opened this issue Dec 20, 2013 · 10 comments
Closed

deadlock with tbb #1402

mylanyuer opened this issue Dec 20, 2013 · 10 comments

Comments

@mylanyuer
Copy link

Hhvm stops handling any event while we execute some parallelling requests.

Threads info:
USER %CPU PRI SCNT WCHAN USER SYSTEM TID TIME
work 95.6 - - - - - - 00:06:42
work 0.0 19 - - - - 30713 00:00:00
work 0.0 19 - - - - 31667 00:00:00
work 0.0 19 - - - - 31668 00:00:00
work 0.0 19 - - - - 31669 00:00:00
work 0.0 19 - - - - 31670 00:00:00
work 0.0 19 - - - - 31671 00:00:00
work 0.0 19 - - - - 31672 00:00:00
work 0.0 19 - - - - 31673 00:00:00
work 0.0 19 - - - - 31674 00:00:00
work 95.1 19 - - - - 31675 00:06:38
work 0.0 19 - - - - 31676 00:00:00
work 0.0 19 - - - - 31677 00:00:00
work 0.0 19 - - - - 31678 00:00:00
work 0.0 19 - - - - 31679 00:00:00
work 0.1 19 - - - - 31680 00:00:00
work 0.0 19 - - - - 32612 00:00:00

I've send an abort signal to the thread which'is consuming CPU most seriously, then got this stacktrace.

0 HPHP::bt_handler(int) at crash-reporter.cpp:0
1 restore_rt at sigaction.c:0
2 GI___sched_yield at :0
3 tbb::interface5::concurrent_hash_map<std::basic_string<char, std::char_traits, std::allocator >, HPHP::AtomicSmartPtrHPHP::StatCache::Node, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::basic_string<char, std::char_traits, std::allocator >, HPHP::AtomicSmartPtrHPHP::StatCache::Node > > >::lookup(bool, std::basic_string<char, std::char_traits, std::allocator > const&, HPHP::AtomicSmartPtrHPHP::StatCache::Node const
, tbb::interface5::concurrent_hash_map<std::basic_string<char, std::char_traits, std::allocator >, HPHP::AtomicSmartPtrHPHP::StatCache::Node, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::basic_string<char, std::char_traits, std::allocator >, HPHP::AtomicSmartPtrHPHP::StatCache::Node > > >::const_accessor
, bool) at /home/work/hhvm/bin/hhvm:0
4 HPHP::StatCache::removePath(std::basic_string<char, std::char_traits, std::allocator > const&, HPHP::StatCache::Node
) at /home/work/hhvm/bin/hhvm:0
5 void HPHP::StatCache::Node::touchLocked(bool) at /home/work/hhvm/bin/hhvm:0
6 HPHP::StatCache::Node::expirePaths(bool) at /home/work/hhvm/bin/hhvm:0
7 HPHP::StatCache::handleEvent(inotify_event const) at /home/work/hhvm/bin/hhvm:0
8 HPHP::StatCache::refresh() at /home/work/hhvm/bin/hhvm:0
9 HPHP::hphp_session_init() at /home/work/hhvm/bin/hhvm:0
10 HPHP::HttpRequestHandler::handleRequest(HPHP::Transport_) at /home/work/hhvm/bin/hhvm:0
11 HPHP::ServerWorkerstd::shared_ptr<HPHP::LibEventJob, HPHP::LibEventTransportTraits>::doJobImpl(std::shared_ptrHPHP::LibEventJob, bool) at /home/work/hhvm/bin/hhvm:0
12 HPHP::ServerWorkerstd::shared_ptr<HPHP::LibEventJob, HPHP::LibEventTransportTraits>::doJob(std::shared_ptrHPHP::LibEventJob) at /home/work/hhvm/bin/hhvm:0
13 HPHP::JobQueueWorkerstd::shared_ptr<HPHP::LibEventJob, true, false, HPHP::JobQueueDropVMStack>::start() at /home/work//hhvm/bin/hhvm:0
14 HPHP::AsyncFuncImpl::ThreadFunc(void_) at /home/work/hhvm/bin/hhvm:0
15 start_thread at pthread_create.c:0
16 __clone at /opt/compiler/gcc-4.8.1/lib/libc.so.6:0


Seems like a deadlock. Will anybody give us some help?
Tbb's version is 4.2, and hhvm 2.2.

Something specially is that one or more php scripts could be refreshed frequently during requests.

@scannell
Copy link
Contributor

Thanks for reporting this and doing the investigation that you already have! One workaround is setting Server.StatCache = false in your config.hdf. If you can get us a reliable repro (which is going to be very difficult with this sort of thing) we can investigate further. Otherwise, we'll eventually get to this but it may not be that soon. (If anyone else figures it out, please let us know and submit a PR.)

@jdelong
Copy link
Contributor

jdelong commented Dec 20, 2013

Hmm. There might be a racy access to one of the non-tbb maps, and it got in a state that allowed an infinite loop?

Just a note: we're not currently using the stat cache for anything at facebook, so it's probably not very well stress-tested at this point. I'd probably recommend turning it off for now (unless you have huge numbers of requires/includes in each endpoint it's probably not going to be a very large performance hit).

@jdelong
Copy link
Contributor

jdelong commented Dec 20, 2013

We just noticed we're defaulting StatCache to on in the OSS config---I'll change it to default off.

@atdt
Copy link
Contributor

atdt commented May 15, 2015

We're using the stat cache in production at Wikimedia and we hit this bug every so often. There are some notes in T89912 on our Phabricator instance. I think this should be re-opened.

@paulbiss paulbiss reopened this May 15, 2015
@blblack
Copy link

blblack commented Feb 2, 2016

Note we're still observing this bug up through at least hhvm 3.6.5. The TL;DR on the ticket we linked last year above ( https://phabricator.wikimedia.org/T89912#1286874 ) is that a probable workaround or fix is to change one of the StatCache mutexes to recursive/re-entrant, but we haven't actually tested whether this alleviates the issue or not. The non-reentrant mutex in question is still present up through the latest master, here: https://github.com/facebook/hhvm/blob/master/hphp/runtime/base/stat-cache.cpp#L173 .

@anoakie
Copy link

anoakie commented Dec 13, 2016

I believe that I was hitting a similar bug. I had stat cache enabled, but didn't have any issues with 3.12, 3.13, or 3.14. I started triggering this issue in 3.15. Attached is the trace with (I assume) Thread 7 being the offending thread with the lock. As suggested in this thread, disabled stat cache for now.
StatCache.txt

@azmng
Copy link

azmng commented Mar 6, 2017

We've had the same issue with all 3.15 and 3.17 versions.
Disabling hhvm.server.stat_cache works around the problem but leads to significantly higher SYS CPU usage because of constant uncached stat() calls.
Starting from version 3.18.1 we ran into #7567 that crashes, so unable to say if deadlocks was fixed or not.

@azmng
Copy link

azmng commented Mar 15, 2017

After #7567 got fixed, we were able to hit the same deadlock as before with latest 3.18.0-dev (master branch):

Enabled hhvm.server.stat_cache leads to the deadlock. 'perf top' shows hhvm process stuck in _ZN3tbb10interface519concurrent_hash_mapISsN4HPHP19AtomicSharedPtrImplINS2_9StatCache4NodeELb0EEENS2_17stringHashCompareENS_13tbb_allocatorISt

Compiled with latest TBB:
Intel TBB 2017 Update 3
TBB_INTERFACE_VERSION == 9103

Same as before, disabling hhvm.server.stat_cache works around the problem but leads to significantly higher SYS CPU usage because of constant uncached stat() calls.

So we have to stick with 3.14.5

@mmuehlenhoff
Copy link

mmuehlenhoff commented Apr 7, 2017

@azmng #7756 fixed a locking bug in stat_cache, which hit mediawiki reliably. I suggest you give it a shot (for 3.18 you'll also need the fix from #7567 for an unrelated crash in stat_cache)

@mofarrell
Copy link
Contributor

Let me know if its not fixed. It looks to be the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants