New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http2: remove exceptions from H/2 codec #11575
Conversation
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@@ -208,7 +209,8 @@ void ConnectionImpl::StreamImpl::encodeTrailersBase(const HeaderMap& trailers) { | |||
pending_trailers_to_encode_ = createHeaderMap<HeaderMapImpl>(trailers); | |||
} else { | |||
submitTrailers(trailers); | |||
parent_.sendPendingFrames(); | |||
auto status = parent_.sendPendingFrames(); | |||
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ASSERT that dispatching_ != true in these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly I think this might not always be true. It's definitely possible that we can encode in the context of dispatch (like an immediate local reply). I think ultimately we are going to have to pivot on whether dispatching to decide how/when to handle errors in these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some of these methods, such as encodeHeadersBase()
can be called when dispatching data. Note that the sendPendingFrames()
is a noop when dispatching_ != true
and always returns OK status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the sendPendingFrames() is a noop when dispatching_ != true and always returns OK status.
You mean == true, right? Since in that case we send pending frames at the end of dispatch. The problem is when we are not in dispatch. That is when I think we have to latch all errors and save them for the next dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, it was a typo. dispatching_ == true
of course.
@@ -146,7 +146,8 @@ void ConnectionImpl::StreamImpl::encodeHeadersBase(const std::vector<nghttp2_nv> | |||
|
|||
local_end_stream_ = end_stream; | |||
submitHeaders(final_headers, end_stream ? nullptr : &provider); | |||
parent_.sendPendingFrames(); | |||
auto status = parent_.sendPendingFrames(); | |||
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty concerned about adding a bunch of release asserts to the codec path.
Can we find another solution? I'm not against a TODO here if we have a plan. return the errors up and handle them or change sendPendingFrames so it say latches the error and handles it later or have two sendPendingFrames function, one called from the dispatching path, one called from the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#11503
Ideally would be replaced with ENVOY_BUGs
I think maybe separating dispatching context might actually be a good thing. It would ensure coverage over both paths, which I think has caused an issue before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/envoyproxy/envoy-setec/pull/181 for other context on this. This whole situation is a total mess on many accounts. I actually think with the current code the only sane thing to do is to crash here, as in the current code this would be an uncaught exception. I think using ENVOY_BUG here would lead to undefined behavior.
As I mention in the other PR/issue, I think the only sane way to handle errors in non-dispatch context is to latch them as mentioned, and then handle them on the next dispatch. Another option (depending on context) would be to close the underlying connection and let the close event do its job.
I think the question is whether we want to try to fix ^ in this PR, or do it iteratively. I would be in favor of doing it iteratively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think iteratively is fine, we should add some clear TODOs explaining this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked through (I think) all call stacks to make sure that this RELEASE_ASSERT did not modify the existing behavior. It should just replace abnormal termination due to unhandled exception. Yes all this will be cleaned up as part of hardening against untrusted upstreams.
Current PoR is to latch the error for later handling. But this is a much bigger change.
Also this change will be flag protected, so if someone is hitting any of these RELEASE_ASSERTS they can go flip to old codecs with exceptions until the problem is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes all this will be cleaned up as part of hardening against untrusted upstreams.
Pedantically (and I know you know this), this is not just about untrusted upstreams. In the current code it's possible (I think) to hit flood exceptions in the out of dispatch encode cases. We need to protect against this also.
But yes I think we are all on the same page about latching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I want to come back to this because I'm still uncomfortable with the release assert. Let's say there's some corner case where sendPendingFrames fails unexpectedly. Would we rather have a write failure trigger a crash (and potentially cause an outage - very fast to debug but potentially expensive) or log an error and increment a stat and fail (no outage, but maybe have to spend a bunch of time figuring out why traffic for RandomClient wasn't working as expected). I know Envoy culture has tended towards the fast crash but I'd really prefer we err in the direction of stat and log. Or maybe we could rewrite sendPendingFrames to take an argument "am I being error checked" now and do exceptions once we've fixed whatever bug we think there might be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not problem replacing RELEASE_ASSERT with that new macro we have added recently to do logging and stat increment. My gut feeling is that it might not be much of an improvement, given that it is very likely something else will crash elsewhere very soon as we are proceeding with processing data where before we would not. It is possible however that in some cases it would not crash, and it is worth reducing the likelihood of the worst outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General approach looks good, just curious if there is any additional testing needed or if this is a pure refactor on all acounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks in general this looks like the right approach. My suggestion per the comment thread is to do the minimal amount of changes necessary to match the current behavior, even though we know it's broken in various ways, then we can go back and fix the broken behavior.
/wait
Signed-off-by: Yan Avlasov <yavlasov@google.com>
There should be no changes to unit tests as the change is purely internal to the codec and does not change the public API. |
Signed-off-by: Yan Avlasov <yavlasov@google.com>
The clang-tidy error is not related to my change but to a problem in the |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Yan Avlasov <yavlasov@google.com>
clang tidy errors are pre-existing |
@@ -1070,25 +1107,40 @@ void ConnectionImpl::sendSettings( | |||
} | |||
} | |||
|
|||
int ConnectionImpl::setAndCheckNghttp2CallbackStatus(Status&& status) { | |||
nghttp2_callback_status_ = std::move(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT that the status was ok in case of overwriting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure we can safely ASSERT this. I'm thinking that it might be possible that sendPendingFrames is called twice, i.e. by filters. I think the question is whether we want to keep the first error status or the last.
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@@ -146,7 +146,8 @@ void ConnectionImpl::StreamImpl::encodeHeadersBase(const std::vector<nghttp2_nv> | |||
|
|||
local_end_stream_ = end_stream; | |||
submitHeaders(final_headers, end_stream ? nullptr : &provider); | |||
parent_.sendPendingFrames(); | |||
auto status = parent_.sendPendingFrames(); | |||
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I want to come back to this because I'm still uncomfortable with the release assert. Let's say there's some corner case where sendPendingFrames fails unexpectedly. Would we rather have a write failure trigger a crash (and potentially cause an outage - very fast to debug but potentially expensive) or log an error and increment a stat and fail (no outage, but maybe have to spend a bunch of time figuring out why traffic for RandomClient wasn't working as expected). I know Envoy culture has tended towards the fast crash but I'd really prefer we err in the direction of stat and log. Or maybe we could rewrite sendPendingFrames to take an argument "am I being error checked" now and do exceptions once we've fixed whatever bug we think there might be?
// Returns Ok Status on success or error if outbound queue limits were exceeded. | ||
Status addOutboundFrameFragment(Buffer::OwnedImpl& output, const uint8_t* data, size_t length); | ||
virtual Status checkOutboundQueueLimits() PURE; | ||
Status incrementOutboundFrameCount(bool is_outbound_flood_monitored_control_frame); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think of ABSL_MUST_USE_RESULT for these to make it clear (and compiler checked ) we're doing the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole Status
class is marked like this here, so any time a method returns Status it is marked that it must be checked.
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/ two small comments/questions.
My personal view is that we should use RELEASE_ASSERT vs. ENVOY_BUG in this case, as crashing is better than undefined behavior (and consistent w/ what we have today), but will defer to all of you.
@@ -1749,7 +1749,12 @@ TEST_P(Http2CodecImplTest, PingFlood) { | |||
buffer.move(frame); | |||
})); | |||
|
|||
EXPECT_THROW(client_->sendPendingFrames(), ServerCodecError); | |||
// Legacy codec does not propagate error details and uses generic error message | |||
EXPECT_THROW_WITH_MESSAGE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it seems odd to me that in the tests we still use exceptions. Would these be better written as a helper which pivots what it does based on new vs. old and then we can just clean up old later? I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the artifact of test construction as it uses codec to test codec. So when dispatch of the codec under test fails the stack has to be unwound through the client codec that was used to build the test wire bytes. The test really needs to be redesigned to avoid doing this. Note that these exceptions are specific to the test, they are not codec exceptions.
// internal mechanisms. Most flood protection is done by Envoy's codec and this error | ||
// should never be returned. However it is handled here in case nghttp2 has some flood | ||
// protections that Envoy's codec does not have. | ||
if (rc == NGHTTP2_ERR_FLOODED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this path have coverage? If not can you add a TODO and try to add it later? In general do we have good coverage on all of the new error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not have test coverage for this case. We try to avoid tripping this case in the integration tests:
envoy/test/integration/http2_integration_test.cc
Line 1502 in dbbcc69
// nghttp2 library has its own internal mitigation for outbound control frames (see |
It is a good question what should be done with internal nghttp2 flood protection. You could not disable it in the past (maybe you can now), so we just set the flood detection limits lower in Envoy. I could construct test for this specifically. Maybe not an integration test as it depends on being able to do a big read in one go, which is harder to control in the integration test.
No, we do not have any tests for error cases that have would end in RELEASE_ASSERT now. These would have to be added as RELEASE_ASSERTs are removed.
This reverts commit 74bf197. Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
FWIW I'm fine with the release asserts as long as it's off by default, so feel free to land this with me in absentia if we don't sort out clang tidy by then :-) |
Signed-off-by: Yan Avlasov <yavlasov@google.com>
I have unfortunately no idea how to fix the clang-tidy error. I checked command lines for clang-tidy and they appear to be correct and include path to the source/common/common/statusor.h. clang-tidy command lines for codec_impl.cc and codec_impl.h seems to be the same yet clang-tidy fails with this error for the .h file but passes for the .cc file. Searching did not help. |
Remove all throw statements from H/2 codec and replace error plumbing with Status and StatusOr objects. Signed-off-by: Yan Avlasov <yavlasov@google.com>
Remove all throw statements from H/2 codec and replace error plumbing with Status and StatusOr objects. Signed-off-by: Yan Avlasov <yavlasov@google.com> Signed-off-by: chaoqinli <chaoqinli@google.com>
Commit Message: Remove all throw statements from H/1 codec This change removed all uses of C++ exceptions from H/1 codec. I modeled the flow after Yan's H/2 work (#11575). Codec status are set in uniform helper methods. This is the only change from the previous PR (#11101), besides merging newer exceptions. This change replaces all throw statements with a return of corresponding error Status and adds plumbing to return the status to codec callers. The dispatch() method returns the encountered error to the caller, which will be handled accordingly. The calls to the RequestEncoder::encodeHeaders() NOT called from dispatch() method will RELEASE_ASSERT if an error code is returned. This does not alter the existing behavior of abnormally terminating the process, just the method of termination: RELEASE_ASSERT vs uncaught exception. Risk Level: High (Codec changes) Signed-off-by: Asra Ali <asraa@google.com>
Commit Message:
Remove all throw statements from H/2 codec
Additional Description:
This change removed all uses of C++ exceptions from H/2 codec.
There are two cases that are important to consider in this change: processing incoming data and building outgoing data.
Processing of incoming data is done by the
dispatch()
method. Thedispatch()
method had thetry/catch
block that would handle all errors, thrown as exceptions, encountered during processing of incoming data by aborting the connection.Building of outgoing data happens in multiple different contexts, such as sending health check requests, early responses or during processing of upstream responses. Only one context had the
try/catch
block - the processing of upstream response. In all other contexts a thrown exception would have caused abnormal process termination. The call sites in these cases areencode...()
methods (+ a few others) which then call thesendPendingFrames()
method to build outbound frames.This change replaces all
throw
statements with areturn
of corresponding errorStatus
and adds plumbing to return the status to codec callers.The
dispatch()
method returns the encountered error to the caller, which will be handled accordingly.The calls to the
sendPendingFrames()
NOT called fromdispatch()
method willRELEASE_ASSERT
if an error code is returned. This does not alter the existing behavior of abnormally terminating the process, just the method of termination:RELEASE_ASSERT
vs uncaught exception.Risk Level: High (Codec changes)
Testing: Unit Tests
Docs Changes: N/A
Release Notes: N/A