Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2 Proactive GOAWAY on Drain - Preamble #17026

Conversation

murray-stripe
Copy link
Contributor

Commit Message:

The initial set of changes (broken up to reduce PR/change size) to
enable proactive GOAWAY signals on drain events. These changes
(currently a no-op) open the way for a push-based notification model to
the Http::ConnectionManagerImpl class (or any other interested party).
Allowing a move away from the pull-based model that currently exists,
which causes issues such as #14350.

The primary change here is to introduce the idea of a "drain tree" which
allows the drain-manager to be structured with parent and child
managers. Each manager may drain independent of it's parent and all
drain actions cascade downward through the tree. The parent <-> child
relationship is thread-safe.

In addition, drain events can now be registered via callbacks to be
invoked when draining starts. This is a non-thread-safe operation.

A new thread-safe callback manager utility has been written to support
the parent <-> child drain-manager relationship.

Additional test coverage has been added for the drain-manager and
thread-safe callback-manager, as well as updates to necessary mock
classes.

Because the callback registration impacts the DrainDecision interface,
necessary no-op implementations have been addedto various classes
implementing this interface. The final versions will be wired up with
correct behavior in subsequent PRs that take advantage of the new
functionality.

Additional Description: This PR was created at the advisement of @antoniovicente in relation to #16201 as an effort to reduce the amount of change being reviewed in a single PR. The content of this PR is meant to be a no-op with #16201 enacting the change (and will be updated once this is merged to reduce it's scope).

Risk Level: Low. Changes included should be no-op.
Testing: Additional test coverage included.
Docs Changes: None.
Release Notes: None.
Platform Specific Features: N/A
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

The initial set of changes (broken up to reduce PR/change size) to
enable proactive GOAWAY signals on drain events. These changes
(currently a no-op) open the way for a push-based notification model to
the Http::ConnectionManagerImpl class (or any other interested party).
Aloowing a move _away_ from the pull-based model that currently exists,
which causes issues such as envoyproxy#14350.

The primary change here is to introduce the idea of a "drain tree" which
allows the drain-manager to be structured with parent and child
managers. Each manager may drain independent of it's parent and all
drain actions cascade downward through the tree. The parent <-> child
relationship is thread-safe.

In addition, drain events can now be registered via callbacks to be
invoked when draining starts. This is a non-thread-safe operation.

A new thread-safe callback manager utility has been written to support
the parent <-> child drain-manager relationship.

Additional test coverage has been added for the drain-manager and
thread-safe callback-manager, as well as updates to necessary mock
classes.

Because the callback registration impacts the DrainDecision interface,
necessary no-op implementations have been addedto various classes
implementing this interface. The final versions will be wired up with
correct behavior in subsequent PRs that take advantage of the new
functionality.

Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Contributor Author

startDrainSequence can only be called once on each DrainManagerImpl as enforced by ASSERTs. It seems to me that PerFilterChainFactoryContextImpl and InstanceImpl::drainListeners() can both invoke startDrainSequence, which would result in startDrainSequence being called twice on the same object: once via child relations on the global drain impl owned by the server instance and again directly via the filter chain manager.

This was a super good callout. I had anticipated for double-calling in the order of 1. direct, 2. parent; but not the other way around. I have updated the code to allow for multiple invocations and simply handling multiple on-drain-complete callbacks. I've added additional test cases to validate this behavior.

@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

The MacOS tests keep being cancelled (I'm guessing it's timing out), but are otherwise passing up until that point. This is ready for review.

source/common/common/callback_impl.cc Outdated Show resolved Hide resolved
source/common/common/callback_impl.h Outdated Show resolved Hide resolved
Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

@antoniovicente I have incorporated your feedback and this is ready for review.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

/wait

source/server/drain_manager_impl.cc Outdated Show resolved Hide resolved
source/common/common/callback_impl.cc Outdated Show resolved Hide resolved
source/common/common/callback_impl.h Outdated Show resolved Hide resolved
Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Contributor Author

@antoniovicente I've addressed your feedback. Really liking how things have simplified. 😄

antoniovicente
antoniovicente previously approved these changes Jun 22, 2021
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks great. Thanks for this refactor in preparation for H2 drain propagation.

/assign-from @envoyproxy/senior-maintainers for a final approval.

@repokitteh-read-only
Copy link

@envoyproxy/senior-maintainers assignee is @zuercher
for assignee is @None
a assignee is @None
final assignee is @None
approval. assignee is @None

🐱

Caused by: a #17026 (review) was submitted by @antoniovicente.

see: more, trace.

@antoniovicente
Copy link
Contributor

retrying macos timeout

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @antoniovicente.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

macos still timing out (seems to be the case most of the time)

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

zuercher
zuercher previously approved these changes Jun 23, 2021
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I found some small issues, in comments and had one question about a test. None of these are blocking.

envoy/server/drain_manager.h Outdated Show resolved Hide resolved
test/server/drain_manager_impl_test.cc Outdated Show resolved Hide resolved
test/server/drain_manager_impl_test.cc Outdated Show resolved Hide resolved
test/server/drain_manager_impl_test.cc Outdated Show resolved Hide resolved
test/common/common/callback_impl_test.cc Show resolved Hide resolved
Signed-off-by: John Murray <murray@stripe.com>
@murray-stripe
Copy link
Contributor Author

@zuercher Thank you for the review! I have addressed your feedback.

@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

gah, keep getting some failures in the coverage build for //test/extensions/filters/http/kill_request:kill_request_filter_integration_test. Seems to be pretty flakey. will give one more re-test and then merge in main if one more doesn't work.

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17026 (comment) was created by @murray-stripe.

see: more, trace.

@murray-stripe
Copy link
Contributor Author

@zuercher This is ready for your review/merge again.

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zuercher zuercher merged commit 96dd735 into envoyproxy:main Jun 28, 2021
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very cool stuff. A few post-submit drive by comments for your next PR. Thank you @murray-stripe!

source/server/drain_manager_impl.h Show resolved Hide resolved
envoy/network/drain_decision.h Show resolved Hide resolved
@alyssawilk
Copy link
Contributor

Hey @murray-stripe could you take a look at
https://storage.googleapis.com/envoy-postsubmit/main/coverage/source/common/common/callback_impl.cc.gcov.html and do a follow-up with some extra unit testing? I think this PR has caused some flakes with falling under our coverage bar. I'm addressing with #17190 but it'd be great to add another test and bump the number back up.

Thanks!

chrisxrepo pushed a commit to chrisxrepo/envoy that referenced this pull request Jul 8, 2021
The initial set of changes (broken up to reduce PR/change size) to
enable proactive GOAWAY signals on drain events. These changes
(currently a no-op) open the way for a push-based notification model to
the Http::ConnectionManagerImpl class (or any other interested party).
Aloowing a move _away_ from the pull-based model that currently exists,
which causes issues such as envoyproxy#14350.

The primary change here is to introduce the idea of a "drain tree" which
allows the drain-manager to be structured with parent and child
managers. Each manager may drain independent of it's parent and all
drain actions cascade downward through the tree. The parent <-> child
relationship is thread-safe.

In addition, drain events can now be registered via callbacks to be
invoked when draining starts. This is a non-thread-safe operation.

A new thread-safe callback manager utility has been written to support
the parent <-> child drain-manager relationship.

Additional test coverage has been added for the drain-manager and
thread-safe callback-manager, as well as updates to necessary mock
classes.

Because the callback registration impacts the DrainDecision interface,
necessary no-op implementations have been addedto various classes
implementing this interface. The final versions will be wired up with
correct behavior in subsequent PRs that take advantage of the new
functionality.

Risk Level: Low. Changes included should be no-op.
Testing: Additional test coverage included.
Docs Changes: n/a
Release Notes: n/a
Signed-off-by: John Murray <murray@stripe.com>
Signed-off-by: chris.xin <xinchuantao@qq.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
The initial set of changes (broken up to reduce PR/change size) to
enable proactive GOAWAY signals on drain events. These changes
(currently a no-op) open the way for a push-based notification model to
the Http::ConnectionManagerImpl class (or any other interested party).
Aloowing a move _away_ from the pull-based model that currently exists,
which causes issues such as envoyproxy#14350.

The primary change here is to introduce the idea of a "drain tree" which
allows the drain-manager to be structured with parent and child
managers. Each manager may drain independent of it's parent and all
drain actions cascade downward through the tree. The parent <-> child
relationship is thread-safe.

In addition, drain events can now be registered via callbacks to be
invoked when draining starts. This is a non-thread-safe operation.

A new thread-safe callback manager utility has been written to support
the parent <-> child drain-manager relationship.

Additional test coverage has been added for the drain-manager and
thread-safe callback-manager, as well as updates to necessary mock
classes.

Because the callback registration impacts the DrainDecision interface,
necessary no-op implementations have been addedto various classes
implementing this interface. The final versions will be wired up with
correct behavior in subsequent PRs that take advantage of the new
functionality.

Risk Level: Low. Changes included should be no-op.
Testing: Additional test coverage included.
Docs Changes: n/a
Release Notes: n/a
Signed-off-by: John Murray <murray@stripe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants