Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncConnectionPool::CleanUpTimer segfaults #13

Open
y3llowcake opened this issue Jun 2, 2021 · 5 comments
Open

AsyncConnectionPool::CleanUpTimer segfaults #13

y3llowcake opened this issue Jun 2, 2021 · 5 comments

Comments

@y3llowcake
Copy link

We (slack) have been seeing a very slow trickle of segfaults from code paths in the cleanup timer in our production environment. This issue is not new, it's been occurring for a while. I do not yet have a repro for these segfaults.

Relevant version of squangle we are running:

┌─(cy@zebu:~/sl/hhvm/third-party/squangle/src)
└─(29)% git log -n 1 --pretty=short  
commit 9b3d6adf34d4f1ec1c1713a54b9def947384b17b (HEAD)
Author: Jay Edgar <jkedgar@fb.com>

    Update state_ inside mutex

There are two unique stack traces we see. The first is more frequent and appears to occur on a call to std::unordered_map::erase():

0:string"raise at ../sysdeps/unix/sysv/linux/raise.c:51"
1:string"HPHP::bt_handler at /build/hhvm/hphp/runtime/base/crash-reporter.cpp:270"
2:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:1627"
3:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:1864"
4:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:755"
5:string"std::unordered_map<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/unordered_map.h:797"
6:string"facebook::common::mysql_client::AsyncConnectionPool::ConnStorage::cleanupConnections at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:619"
7:string"facebook::common::mysql_client::AsyncConnectionPool::CleanUpTimer::timeoutExpired at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:519"

The second one looks like use of an invalid map iterator ('ref_iter->second' where ref_iter is probably pointing to end()?) :

0:string"raise at ../sysdeps/unix/sysv/linux/raise.c:51"
1:string"HPHP::bt_handler at /build/hhvm/hphp/runtime/base/crash-reporter.cpp:270"
2:string"facebook::common::mysql_client::AsyncMysqlClient::activeConnectionRemoved at /build/hhvm/third-party/squangle/src/squangle/mysql_client/AsyncMysqlClient.h:357"
3:string"facebook::common::mysql_client::MysqlConnectionHolder::~MysqlConnectionHolder at /build/hhvm/third-party/squangle/squangle/mysql_client/Connection.cpp:63"
4:string"facebook::common::mysql_client::MysqlConnectionHolder::~MysqlConnectionHolder at /build/hhvm/third-party/squangle/squangle/mysql_client/Connection.cpp:64"
5:string"std::default_delete<facebook::common::mysql_client::MysqlPooledHolder>::operator() at /usr/include/c++/7/bits/unique_ptr.h:78"
6:string"std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/unique_ptr.h:263"
7:string"__gnu_cxx::new_allocator<std::_List_node<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/ext/new_allocator.h:140"
8:string"std::allocator_traits<std::allocator<std::_List_node<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/alloc_traits.h:487"
9:string"std::__cxx11::list<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/stl_list.h:1815"
10:string"facebook::common::mysql_client::AsyncConnectionPool::ConnStorage::cleanupConnections at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:613"
11:string"facebook::common::mysql_client::AsyncConnectionPool::CleanUpTimer::timeoutExpired at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:519"
12:string"folly::AsyncTimeout::libeventCallback at /build/hhvm/third-party/folly/src/folly/io/async/AsyncTimeout.cpp:171"
13:string"folly::EventBase::loopBody at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:394"
14:string"folly::EventBase::loop at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:312"
15:string"folly::EventBase::loopForever at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:535"
16:string"facebook::common::mysql_client::AsyncMysqlClient::<lambda()>::operator() at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncMysqlClient.cpp:80"
17:string"std::__invoke_impl<void, at /usr/include/c++/7/bits/invoke.h:60"
18:string"std::__invoke<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/bits/invoke.h:95"
19:string"std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:234"
20:string"std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:243"
21:string"std::thread::_State_impl<std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:186"
22:string"start_thread at pthread_create.c:463"
23:string"clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95"

Additionally noteworthy is our non standard hacklang pool configuration:

new AsyncMysqlConnectionPool(darray[
	'per_key_connection_limit' => 20,
	'idle_timeout_micros' => 4000000,
	'expiration_policy' => "IdleTime",
]);
@jupyung
Copy link

jupyung commented Jun 3, 2021

Thanks for reporting the issue. Our team started to look at this issue. We will get back to you soon.

@y3llowcake
Copy link
Author

Curious if you have any intuition about what might be causing this. Based on other crashes, we think we might be victims to subtle memory corruption bugs in HHVM, but given the frequency and predictability of these stack traces I am more inclined to think the bug is in squangle.

@y3llowcake
Copy link
Author

I am curious if the following commit is related to this issue: 9737cfd

@jupyung
Copy link

jupyung commented Jun 1, 2022

Yes, the commit you mentioned was meant to fix rare segfault happening in connection cleanup, which looks related to this issue.

@fredemmott
Copy link
Contributor

Yep, cherry-picking that commit fixed this issue for Slack :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants