Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http2: remove exceptions from H/2 codec #11575

Merged
merged 17 commits into from Jul 31, 2020

Conversation

yanavlasov
Copy link
Contributor

@yanavlasov yanavlasov commented Jun 12, 2020

Commit Message:
Remove all throw statements from H/2 codec

Additional Description:
This change removed all uses of C++ exceptions from H/2 codec.
There are two cases that are important to consider in this change: processing incoming data and building outgoing data.
Processing of incoming data is done by the dispatch() method. The dispatch() method had the try/catch block that would handle all errors, thrown as exceptions, encountered during processing of incoming data by aborting the connection.
Building of outgoing data happens in multiple different contexts, such as sending health check requests, early responses or during processing of upstream responses. Only one context had the try/catch block - the processing of upstream response. In all other contexts a thrown exception would have caused abnormal process termination. The call sites in these cases are encode...() methods (+ a few others) which then call the sendPendingFrames() method to build outbound frames.

This change replaces all throw statements with a return of corresponding error Status and adds plumbing to return the status to codec callers.
The dispatch() method returns the encountered error to the caller, which will be handled accordingly.

The calls to the sendPendingFrames() NOT called from dispatch() method will RELEASE_ASSERT if an error code is returned. This does not alter the existing behavior of abnormally terminating the process, just the method of termination: RELEASE_ASSERT vs uncaught exception.

Risk Level: High (Codec changes)
Testing: Unit Tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@@ -208,7 +209,8 @@ void ConnectionImpl::StreamImpl::encodeTrailersBase(const HeaderMap& trailers) {
pending_trailers_to_encode_ = createHeaderMap<HeaderMapImpl>(trailers);
} else {
submitTrailers(trailers);
parent_.sendPendingFrames();
auto status = parent_.sendPendingFrames();
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ASSERT that dispatching_ != true in these methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly I think this might not always be true. It's definitely possible that we can encode in the context of dispatch (like an immediate local reply). I think ultimately we are going to have to pivot on whether dispatching to decide how/when to handle errors in these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of these methods, such as encodeHeadersBase() can be called when dispatching data. Note that the sendPendingFrames() is a noop when dispatching_ != true and always returns OK status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the sendPendingFrames() is a noop when dispatching_ != true and always returns OK status.

You mean == true, right? Since in that case we send pending frames at the end of dispatch. The problem is when we are not in dispatch. That is when I think we have to latch all errors and save them for the next dispatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, it was a typo. dispatching_ == true of course.

source/common/http/http2/codec_impl.cc Outdated Show resolved Hide resolved
@@ -146,7 +146,8 @@ void ConnectionImpl::StreamImpl::encodeHeadersBase(const std::vector<nghttp2_nv>

local_end_stream_ = end_stream;
submitHeaders(final_headers, end_stream ? nullptr : &provider);
parent_.sendPendingFrames();
auto status = parent_.sendPendingFrames();
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty concerned about adding a bunch of release asserts to the codec path.
Can we find another solution? I'm not against a TODO here if we have a plan. return the errors up and handle them or change sendPendingFrames so it say latches the error and handles it later or have two sendPendingFrames function, one called from the dispatching path, one called from the other?

Copy link
Contributor

@asraa asraa Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11503
Ideally would be replaced with ENVOY_BUGs

I think maybe separating dispatching context might actually be a good thing. It would ensure coverage over both paths, which I think has caused an issue before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/envoyproxy/envoy-setec/pull/181 for other context on this. This whole situation is a total mess on many accounts. I actually think with the current code the only sane thing to do is to crash here, as in the current code this would be an uncaught exception. I think using ENVOY_BUG here would lead to undefined behavior.

As I mention in the other PR/issue, I think the only sane way to handle errors in non-dispatch context is to latch them as mentioned, and then handle them on the next dispatch. Another option (depending on context) would be to close the underlying connection and let the close event do its job.

I think the question is whether we want to try to fix ^ in this PR, or do it iteratively. I would be in favor of doing it iteratively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think iteratively is fine, we should add some clear TODOs explaining this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked through (I think) all call stacks to make sure that this RELEASE_ASSERT did not modify the existing behavior. It should just replace abnormal termination due to unhandled exception. Yes all this will be cleaned up as part of hardening against untrusted upstreams.
Current PoR is to latch the error for later handling. But this is a much bigger change.
Also this change will be flag protected, so if someone is hitting any of these RELEASE_ASSERTS they can go flip to old codecs with exceptions until the problem is resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes all this will be cleaned up as part of hardening against untrusted upstreams.

Pedantically (and I know you know this), this is not just about untrusted upstreams. In the current code it's possible (I think) to hit flood exceptions in the out of dispatch encode cases. We need to protect against this also.

But yes I think we are all on the same page about latching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I want to come back to this because I'm still uncomfortable with the release assert. Let's say there's some corner case where sendPendingFrames fails unexpectedly. Would we rather have a write failure trigger a crash (and potentially cause an outage - very fast to debug but potentially expensive) or log an error and increment a stat and fail (no outage, but maybe have to spend a bunch of time figuring out why traffic for RandomClient wasn't working as expected). I know Envoy culture has tended towards the fast crash but I'd really prefer we err in the direction of stat and log. Or maybe we could rewrite sendPendingFrames to take an argument "am I being error checked" now and do exceptions once we've fixed whatever bug we think there might be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not problem replacing RELEASE_ASSERT with that new macro we have added recently to do logging and stat increment. My gut feeling is that it might not be much of an improvement, given that it is very likely something else will crash elsewhere very soon as we are proceeding with processing data where before we would not. It is possible however that in some cases it would not crash, and it is worth reducing the likelihood of the worst outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach looks good, just curious if there is any additional testing needed or if this is a pure refactor on all acounts.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general this looks like the right approach. My suggestion per the comment thread is to do the minimal amount of changes necessary to match the current behavior, even though we know it's broken in various ways, then we can go back and fix the broken behavior.

/wait

source/common/http/http2/codec_impl.cc Outdated Show resolved Hide resolved
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov
Copy link
Contributor Author

There should be no changes to unit tests as the change is purely internal to the codec and does not change the public API.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov
Copy link
Contributor Author

The clang-tidy error is not related to my change but to a problem in the LinkedObject::moveIntoList

@stale
Copy link

stale bot commented Jul 3, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 3, 2020
@stale
Copy link

stale bot commented Jul 11, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Jul 11, 2020
@yanavlasov yanavlasov reopened this Jul 20, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 20, 2020
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov yanavlasov changed the title WiP http2: remove exceptions from H/2 codec http2: remove exceptions from H/2 codec Jul 21, 2020
@yanavlasov
Copy link
Contributor Author

clang tidy errors are pre-existing

@@ -1070,25 +1107,40 @@ void ConnectionImpl::sendSettings(
}
}

int ConnectionImpl::setAndCheckNghttp2CallbackStatus(Status&& status) {
nghttp2_callback_status_ = std::move(status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT that the status was ok in case of overwriting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure we can safely ASSERT this. I'm thinking that it might be possible that sendPendingFrames is called twice, i.e. by filters. I think the question is whether we want to keep the first error status or the last.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@@ -146,7 +146,8 @@ void ConnectionImpl::StreamImpl::encodeHeadersBase(const std::vector<nghttp2_nv>

local_end_stream_ = end_stream;
submitHeaders(final_headers, end_stream ? nullptr : &provider);
parent_.sendPendingFrames();
auto status = parent_.sendPendingFrames();
RELEASE_ASSERT(status.ok(), "sendPendingFrames() failure in non dispatching context");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I want to come back to this because I'm still uncomfortable with the release assert. Let's say there's some corner case where sendPendingFrames fails unexpectedly. Would we rather have a write failure trigger a crash (and potentially cause an outage - very fast to debug but potentially expensive) or log an error and increment a stat and fail (no outage, but maybe have to spend a bunch of time figuring out why traffic for RandomClient wasn't working as expected). I know Envoy culture has tended towards the fast crash but I'd really prefer we err in the direction of stat and log. Or maybe we could rewrite sendPendingFrames to take an argument "am I being error checked" now and do exceptions once we've fixed whatever bug we think there might be?

// Returns Ok Status on success or error if outbound queue limits were exceeded.
Status addOutboundFrameFragment(Buffer::OwnedImpl& output, const uint8_t* data, size_t length);
virtual Status checkOutboundQueueLimits() PURE;
Status incrementOutboundFrameCount(bool is_outbound_flood_monitored_control_frame);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think of ABSL_MUST_USE_RESULT for these to make it clear (and compiler checked ) we're doing the right thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole Status class is marked like this here, so any time a method returns Status it is marked that it must be checked.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/ two small comments/questions.

My personal view is that we should use RELEASE_ASSERT vs. ENVOY_BUG in this case, as crashing is better than undefined behavior (and consistent w/ what we have today), but will defer to all of you.

@@ -1749,7 +1749,12 @@ TEST_P(Http2CodecImplTest, PingFlood) {
buffer.move(frame);
}));

EXPECT_THROW(client_->sendPendingFrames(), ServerCodecError);
// Legacy codec does not propagate error details and uses generic error message
EXPECT_THROW_WITH_MESSAGE(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems odd to me that in the tests we still use exceptions. Would these be better written as a helper which pivots what it does based on new vs. old and then we can just clean up old later? I don't feel strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the artifact of test construction as it uses codec to test codec. So when dispatch of the codec under test fails the stack has to be unwound through the client codec that was used to build the test wire bytes. The test really needs to be redesigned to avoid doing this. Note that these exceptions are specific to the test, they are not codec exceptions.

// internal mechanisms. Most flood protection is done by Envoy's codec and this error
// should never be returned. However it is handled here in case nghttp2 has some flood
// protections that Envoy's codec does not have.
if (rc == NGHTTP2_ERR_FLOODED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this path have coverage? If not can you add a TODO and try to add it later? In general do we have good coverage on all of the new error handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not have test coverage for this case. We try to avoid tripping this case in the integration tests:

// nghttp2 library has its own internal mitigation for outbound control frames (see

It is a good question what should be done with internal nghttp2 flood protection. You could not disable it in the past (maybe you can now), so we just set the flood detection limits lower in Envoy. I could construct test for this specifically. Maybe not an integration test as it depends on being able to do a big read in one go, which is harder to control in the integration test.

No, we do not have any tests for error cases that have would end in RELEASE_ASSERT now. These would have to be added as RELEASE_ASSERTs are removed.

This reverts commit 74bf197.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@alyssawilk
Copy link
Contributor

FWIW I'm fine with the release asserts as long as it's off by default, so feel free to land this with me in absentia if we don't sort out clang tidy by then :-)

Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov
Copy link
Contributor Author

yanavlasov commented Jul 30, 2020

I have unfortunately no idea how to fix the clang-tidy error. I checked command lines for clang-tidy and they appear to be correct and include path to the source/common/common/statusor.h. clang-tidy command lines for codec_impl.cc and codec_impl.h seems to be the same yet clang-tidy fails with this error for the .h file but passes for the .cc file. Searching did not help.
Also since this is compiler error you can not // NOLINT it.
I'm unsure what to do at this point.

@alyssawilk alyssawilk removed their assignment Jul 30, 2020
@yanavlasov yanavlasov merged commit a2ebf09 into envoyproxy:master Jul 31, 2020
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
Remove all throw statements from H/2 codec and replace error plumbing with Status and StatusOr objects.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
Remove all throw statements from H/2 codec and replace error plumbing with Status and StatusOr objects.

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
asraa added a commit that referenced this pull request Aug 10, 2020
Commit Message:
Remove all throw statements from H/1 codec
This change removed all uses of C++ exceptions from H/1 codec. I modeled the flow after Yan's H/2 work (#11575). Codec status are set in uniform helper methods. This is the only change from the previous PR (#11101), besides merging newer exceptions.

This change replaces all throw statements with a return of corresponding error Status and adds plumbing to return the status to codec callers. The dispatch() method returns the encountered error to the caller, which will be handled accordingly.

The calls to the RequestEncoder::encodeHeaders() NOT called from dispatch() method will RELEASE_ASSERT if an error code is returned. This does not alter the existing behavior of abnormally terminating the process, just the method of termination: RELEASE_ASSERT vs uncaught exception.

Risk Level: High (Codec changes)
Signed-off-by: Asra Ali <asraa@google.com>
@yanavlasov yanavlasov deleted the exceptions-1 branch August 19, 2020 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants