Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP_Proxy: onLowWatermark crash #3639

Closed
lap1817 opened this issue Jun 15, 2018 · 10 comments
Closed

TCP_Proxy: onLowWatermark crash #3639

lap1817 opened this issue Jun 15, 2018 · 10 comments
Assignees
Labels
Milestone

Comments

@lap1817
Copy link

lap1817 commented Jun 15, 2018

Title: TCP_Proxy: onLowWatermark crash

Description:

Envoy crashes occasionally due to segmentation failure in TCP proxy. Backtrace shows that it is from the onLowWatermarh function. In a debug build, the failure appears as aborted.

Envoy Build SHA

2b216ca

Repro steps:

Consistently repo in our env when upstream services are connected through TCP proxy.

Backtrace

 Caught Aborted, suspect faulting address 0x18a
 Backtrace thr<420> obj</lib/x86_64-linux-gnu/libc.so.6> (If unsymbolized, use tools/stack_decode.py):
 thr<420> #0 0x7f66a6cb7c37 (unknown)
 thr<420> #1 0x7f66a6cbb027 (unknown)
 thr<420> obj<envoy>
 thr<420> #2 0x8e8c02 Envoy::Network::ConnectionImpl::readDisable()
 thr<420> #3 0xc5cce2 Envoy::TcpProxy::Filter::readDisableDownstream()
 thr<420> #4 0xc5cf91 Envoy::TcpProxy::Filter::UpstreamCallbacks::onBelowWriteBufferLowWatermark()
 thr<420> #5 0x8e93fe Envoy::Network::ConnectionImpl::onLowWatermark()
 thr<420> #6 0x8e71ea Envoy::Network::ConnectionImpl::ConnectionImpl()::{lambda()#1}::operator()()
 thr<420> #7 0x8ead9c std::_Function_handler<>::_M_invoke()
 thr<420> #8 0x4e4b13 std::function<>::operator()()
 thr<420> #9 0xc526d4 Envoy::Buffer::WatermarkBuffer::checkLowWatermark()
 thr<420> #10 0xc524ae Envoy::Buffer::WatermarkBuffer::drain()
 thr<420> #11 0xe550a9 Envoy::Buffer::OwnedImpl::write()
 thr<420> #12 0xc525bf Envoy::Buffer::WatermarkBuffer::write()
 thr<420> #13 0xc062f3 Envoy::Network::RawBufferSocket::doWrite()
 thr<420> #14 0x8e85db Envoy::Network::ConnectionImpl::close()
 thr<420> #15 0xc5e6e5 Envoy::TcpProxy::Filter::onDownstreamEvent()
 thr<420> #16 0xc61deb Envoy::TcpProxy::Filter::DownstreamCallbacks::onEvent()
 thr<420> #17 0x8e8ef4 Envoy::Network::ConnectionImpl::raiseEvent()
 thr<420> #18 0x8e886a Envoy::Network::ConnectionImpl::closeSocket()
 thr<420> #19 0x8e85ec Envoy::Network::ConnectionImpl::close()
 thr<420> #20 0xc5f2d3 Envoy::TcpProxy::Filter::onIdleTimeout()
 thr<420> #21 0xc5d064 Envoy::TcpProxy::Filter::UpstreamCallbacks::onIdleTimeout()
 thr<420> #22 0xc5e8e5 Envoy::TcpProxy::Filter::onUpstreamEvent()::{lambda()#1}::operator()()
 thr<420> #23 0xc60259 std::_Function_handler<>::_M_invoke()
 thr<420> #24 0x4e4b13 std::function<>::operator()()
 thr<420> #25 0x8e06e0 Envoy::Event::TimerImpl::TimerImpl()::{lambda()#1}::operator()()
 thr<420> #26 0x8e070f Envoy::Event::TimerImpl::TimerImpl()::{lambda()#1}::_FUN()
 thr<420> #27 0xe67b77 event_process_active_single_queue.isra.29
 thr<420> #28 0xe680de event_base_loop
 thr<420> #29 0x8d8505 Envoy::Event::DispatcherImpl::run()
 thr<420> #30 0x8c8a54 Envoy::Server::WorkerImpl::threadRoutine()
 thr<420> #31 0x8c8579 Envoy::Server::WorkerImpl::start()::{lambda()#1}::operator()()
 thr<420> #32 0x8c914a std::_Function_handler<>::_M_invoke()
 thr<420> #33 0x4e4b13 std::function<>::operator()()
 thr<420> #34 0xebe42d Envoy::Thread::Thread::Thread()::{lambda()#1}::operator()()
 thr<420> #35 0xebe452 Envoy::Thread::Thread::Thread()::{lambda()#1}::_FUN()
 thr<420> obj</lib/x86_64-linux-gnu/libpthread.so.0>
 thr<420> #36 0x7f66a7358183 start_thread
 thr<420> obj</lib/x86_64-linux-gnu/libc.so.6>
 thr<420> #37 0x7f66a6d7f03c (unknown)
 end backtrace thread 420
: - starting hot-restarter with target: /opt/smartstack/envoy/run-envoy.sh
: - forking and execing new child process at epoch 0
: - forked new child process with PID=394
: - got SIGCHLD
: - PID=394 was killed with signal=6
: - Due to abnormal exit, force killing all child processes and exiting
: - exiting due to lack of child processes

@lap1817
Copy link
Author

lap1817 commented Jun 15, 2018

One more backtrace on a similar crash. This one also shows an Assert failure which probably caused the abort?

2018-06-15T16:05:11.258710+00:00 : - [2018-06-15 16:05:11.258][10388][critical][assert] source/common/network/connection_impl.cc:226] assert failure: state() == State::Open
 Caught Aborted, suspect faulting address 0x1c45
 Backtrace thr<10388> obj</lib/x86_64-linux-gnu/libc.so.6> (If unsymbolized, use tools/stack_decode.py):
 thr<10388> #0 0x7fc8a3562c37 (unknown)
 thr<10388> #1 0x7fc8a3566027 (unknown)
 thr<10388> obj<envoy>
 thr<10388> #2 0x8e8c02 Envoy::Network::ConnectionImpl::readDisable()
 thr<10388> #3 0xc5cce2 Envoy::TcpProxy::Filter::readDisableDownstream()
 thr<10388> #4 0xc5cf91 Envoy::TcpProxy::Filter::UpstreamCallbacks::onBelowWriteBufferLowWatermark()
 thr<10388> #5 0x8e93fe Envoy::Network::ConnectionImpl::onLowWatermark()
 thr<10388> #6 0x8e71ea Envoy::Network::ConnectionImpl::ConnectionImpl()::{lambda()#1}::operator()()
 thr<10388> #7 0x8ead9c std::_Function_handler<>::_M_invoke()
 thr<10388> #8 0x4e4b13 std::function<>::operator()()
 thr<10388> #9 0xc526d4 Envoy::Buffer::WatermarkBuffer::checkLowWatermark()
 thr<10388> #10 0xc524ae Envoy::Buffer::WatermarkBuffer::drain()
 thr<10388> #11 0xe550a9 Envoy::Buffer::OwnedImpl::write()
 thr<10388> #12 0xc525bf Envoy::Buffer::WatermarkBuffer::write()
 thr<10388> #13 0xc062f3 Envoy::Network::RawBufferSocket::doWrite()
 thr<10388> #14 0x8e85db Envoy::Network::ConnectionImpl::close()
 thr<10388> #15 0xc5e6e5 Envoy::TcpProxy::Filter::onDownstreamEvent()
 thr<10388> #16 0xc61deb Envoy::TcpProxy::Filter::DownstreamCallbacks::onEvent()
 thr<10388> #17 0x8e8ef4 Envoy::Network::ConnectionImpl::raiseEvent()
 thr<10388> #18 0x8e886a Envoy::Network::ConnectionImpl::closeSocket()
 thr<10388> #19 0x8e85ec Envoy::Network::ConnectionImpl::close()
 thr<10388> #20 0xc5f2d3 Envoy::TcpProxy::Filter::onIdleTimeout()
 thr<10388> #21 0xc5d064 Envoy::TcpProxy::Filter::UpstreamCallbacks::onIdleTimeout()
 thr<10388> #22 0xc5e8e5 Envoy::TcpProxy::Filter::onUpstreamEvent()::{lambda()#1}::operator()()
 thr<10388> #23 0xc60259 std::_Function_handler<>::_M_invoke()
 thr<10388> #24 0x4e4b13 std::function<>::operator()()
 thr<10388> #25 0x8e06e0 Envoy::Event::TimerImpl::TimerImpl()::{lambda()#1}::operator()()
 thr<10388> #26 0x8e070f Envoy::Event::TimerImpl::TimerImpl()::{lambda()#1}::_FUN()
 thr<10388> #27 0xe67b77 event_process_active_single_queue.isra.29
 thr<10388> #28 0xe680de event_base_loop
 thr<10388> #29 0x8d8505 Envoy::Event::DispatcherImpl::run()
 thr<10388> #30 0x8c8a54 Envoy::Server::WorkerImpl::threadRoutine()
 thr<10388> #31 0x8c8579 Envoy::Server::WorkerImpl::start()::{lambda()#1}::operator()()
 thr<10388> #32 0x8c914a std::_Function_handler<>::_M_invoke()
 thr<10388> #33 0x4e4b13 std::function<>::operator()()
 thr<10388> #34 0xebe42d Envoy::Thread::Thread::Thread()::{lambda()#1}::operator()()
 thr<10388> #35 0xebe452 Envoy::Thread::Thread::Thread()::{lambda()#1}::_FUN()
 thr<10388> obj</lib/x86_64-linux-gnu/libpthread.so.0>
 thr<10388> #36 0x7fc8a3c03183 start_thread
 thr<10388> obj</lib/x86_64-linux-gnu/libc.so.6>
 thr<10388> #37 0x7fc8a362a03c (unknown)
 end backtrace thread 10388

@mattklein123 mattklein123 added this to the 1.7.0 milestone Jun 15, 2018
@mattklein123
Copy link
Member

@ggreenway are you interesting in taking a look at this? I think it's just an instance of calling watermark callbacks on an already dead upstream connection. cc @alyssawilk

@mattklein123 mattklein123 added the help wanted Needs help! label Jun 18, 2018
@mattklein123 mattklein123 modified the milestones: 1.7.0, 1.8.0 Jun 18, 2018
@lap1817
Copy link
Author

lap1817 commented Jun 18, 2018

@mattklein123 , is there any way that we can unblock ourselves on this issue? We are on-holding a major rollout of Envoy in our fleet.

@mattklein123
Copy link
Member

@lap1817 you could debug and fix the issue. Or, if you don't care about watermarks you could increase the limits by a lot such that they are never reached?

@lap1817
Copy link
Author

lap1817 commented Jun 18, 2018

@mattklein123 , What is the limit (parameter) that I can raise to avoid watermark? And what does waterMark do?

@alyssawilk
Copy link
Contributor

per_connection_buffer_limit_bytes
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/lds.proto#envoy-api-msg-listener
https://www.envoyproxy.io/docs/envoy/latest/api-v1/cluster_manager/cluster.html?highlight=per_connection_buffer_limit_bytes

control how much each network connection is willing to buffer.
If you make that very large, you will infinitely buffer (and thus be subject to OOMs) but not encounter your current difficulty :-/

@alyssawilk
Copy link
Contributor

Looking at this, it looks like we reset file_event_ on closeSocket. readDisable() appears to assume the connection is disconnected and references file_event_ without nullptr checks. There's probably some better fix in tcp proxy session to not call this callback in the shutdown state but I think a simple check in readDisable() to return if file_event_ is null would probably address this crash.

I doubt I can dig into the tcp proxy weirdness but if Matt would be amenable to the quick fix I can put something together tomorrow and see if it works well enough for your use casel

@lap1817
Copy link
Author

lap1817 commented Jun 18, 2018

thanks @alyssawilk ! It will be great if we can have a build that avoids the crash. Meanwhile, I will try increase buffer size (though I am a bit concerned about OOM that may come with it)

@alyssawilk alyssawilk self-assigned this Jun 19, 2018
alyssawilk added a commit that referenced this issue Jun 19, 2018
…nnection (#3669)

Hopefully changing an outstanding tcp proxy session crash to a more minor ASSERT failure

Risk Level: Low
Testing: unit test of new code. integration tests which fail to repro the underlying bug.
Docs Changes: n/a
Release Notes: none

Hopefully ameliorates #3639

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor

OK, theoretical workaround submitted. Can you folks try a head build and see if it stops the crashing?

stevenzzzz pushed a commit to stevenzzzz/envoy that referenced this issue Jun 19, 2018
…nnection (envoyproxy#3669)

Hopefully changing an outstanding tcp proxy session crash to a more minor ASSERT failure

Risk Level: Low
Testing: unit test of new code. integration tests which fail to repro the underlying bug.
Docs Changes: n/a
Release Notes: none

Hopefully ameliorates envoyproxy#3639

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@lap1817
Copy link
Author

lap1817 commented Jun 19, 2018

@alyssawilk thanks! will try it today

alyssawilk added a commit that referenced this issue Sep 4, 2018
… closed connection (#4296)

Fixing and regression testing #3639

Risk Level: Low: avoids a no-op call during connection teardown
Testing: verified flow in integration test, wrote a regression unit test
Docs Changes: n/a
Release Notes: n/a
Fixes #3639

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants