Skip to content

Comments

Updating ext_proc filter clear route cache logic.#26388

Merged
htuch merged 26 commits intoenvoyproxy:mainfrom
jstraceski:ext_proc_clear_cache
May 5, 2023
Merged

Updating ext_proc filter clear route cache logic.#26388
htuch merged 26 commits intoenvoyproxy:mainfrom
jstraceski:ext_proc_clear_cache

Conversation

@jstraceski
Copy link
Contributor

@jstraceski jstraceski commented Mar 28, 2023

Commit Message: Modifying clear route cache logic for the ext_proc filter.
Additional Description: Adding in the ability to disable route cache clearing ont he ext_proc filter. Changing the behavior of the clearing to check if there is a header mutation.
Risk Level: Low
Testing: test/extensions/filters/http/ext_proc/filter_test.cc
Docs Changes: docs/root/configuration/http/http_filters/ext_proc_filter.rst
Release Notes: changelogs/current.yaml
Platform Specific Features:

@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #26388 was opened by jstraceski.

see: more, trace.

@tyxia
Copy link
Member

tyxia commented Mar 28, 2023

/assign @tyxia

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #26388 was synchronize by jstraceski.

see: more, trace.

Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Left some comments.
Besides:

  • Could you run the command sudo ./ci/run_envoy_docker.sh '\''./ci/do_ci.sh format'\'' && sh bazel/setup_clang.sh /usr' to try to fix format, so that we can kick off CI/CD
  • Could you add a simple unit test? You probably can refer to ClearRouteCache test in ext_proc/filter_test.cc

@jstraceski jstraceski changed the title Adding modification check to clearing the route cache of the ext_proc filter. WIP Adding modification check to clearing the route cache of the ext_proc filter. Apr 6, 2023
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting closer. I left some comments about test and code comment.
PS: Can you also change this PR from draft to ready for review?

@jstraceski jstraceski force-pushed the ext_proc_clear_cache branch 2 times, most recently from 7e20490 to 67c352e Compare April 10, 2023 20:24
@jstraceski jstraceski marked this pull request as ready for review April 11, 2023 17:47
@jstraceski jstraceski requested a review from snowp as a code owner April 11, 2023 17:47
@tyxia
Copy link
Member

tyxia commented Apr 12, 2023

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #26388 (comment) was created by @tyxia.

see: more, trace.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!
Left minor comments on the API.

A high-level comment (probably not in the scope of this PR), from the point-of-view of the external processor, it will be difficult to reason why the external-processor sent a response to clear the route cache, but the Envoy did not act accordingly. Specifically, the provider of an ext-proc service will see that their service works for Envoy deployment A (that allows cache clearing) but not for Envoy deployment B (that prevents cache clearing), but will not be able to debug why this is happening.
A better approach may be to "publish a contract" when connecting to the external-processor stating what features are enabled/disabled. @htuch WDYT?

@tyxia
Copy link
Member

tyxia commented Apr 12, 2023

Thanks for adding this! Left minor comments on the API.

A high-level comment (probably not in the scope of this PR), from the point-of-view of the external processor, it will be difficult to reason why the external-processor sent a response to clear the route cache, but the Envoy did not act accordingly. Specifically, the provider of an ext-proc service will see that their service works for Envoy deployment A (that allows cache clearing) but not for Envoy deployment B (that prevents cache clearing), but will not be able to debug why this is happening. A better approach may be to "publish a contract" when connecting to the external-processor stating what features are enabled/disabled. @htuch WDYT?

The context/intention (we probably could add it to PR description) is: third-party extension server (which send back response) can do whatever they want. On the other hand, our product (which create ext_proc filter through filter config) want to force clearing behavior off regardless extension's behavior (e.g., due to product requirement, security, etc). So, ext_proc don't really need to set up the contract with extension or report for permission.

I agree it is good to improve the observability/debuggability. We can add the log (,even stats) when common_response.clear_route_cache() == true && filter_.config().disableRouteCacheClearing() == true for this behavior. WDYT?

@adisuissa
Copy link
Contributor

The context/intention (we probably could add it to PR description) is: third-party extension server (which send back response) can do whatever they want. On the other hand, our product (which create ext_proc filter through filter config) want to force clearing behavior off regardless extension's behavior (e.g., due to product requirement, security, etc). So, ext_proc don't really need to set up the contract with extension or report for permission.

I agree that from a security perspective this is a valid feature. FWIW, other fields should also be restricted, depending on the threat-model.

I think we should address the general case, where the extension and the deployment are handled by different entities, and the extension can either be a third-part extensions server or a specific service. The case of an extensions server is probably easier (if the extension server and Envoy are managed by the same entity), as the extensions server should prevent accepting extensions that clear the routes, and this "contract" should be enforced by the extensions server.
The added benefits for a "contract" for the extensions server are: 1) it can clearly reason what kind of extensions can be executed using the Envoy; 2) allows the operators to debug things on their end and do not need to access Envoy's logs/stats. I think it is better to decouple the things as much as possible, and as things stand, if the Envoy deployer updates the ext-proc config on their side, they may break the third-party extension server.

I agree it is good to improve the observability/debuggability. We can add the log (,even stats) when common_response.clear_route_cache() == true && filter_.config().disableRouteCacheClearing() == true for this behavior. WDYT?

I thought of suggesting counters (IMHO logs are typically less useful in large deployments), but the downsides are that stats require more memory, are useful to know if an event happens but not so useful to understand what triggered that event (e.g., which response from the external processor caused it), and don't scale if a similar fix will be needed on different fields.

@tyxia
Copy link
Member

tyxia commented Apr 13, 2023

The context/intention (we probably could add it to PR description) is: third-party extension server (which send back response) can do whatever they want. On the other hand, our product (which create ext_proc filter through filter config) want to force clearing behavior off regardless extension's behavior (e.g., due to product requirement, security, etc). So, ext_proc don't really need to set up the contract with extension or report for permission.

I agree that from a security perspective this is a valid feature. FWIW, other fields should also be restricted, depending on the threat-model.

I think we should address the general case, where the extension and the deployment are handled by different entities, and the extension can either be a third-part extensions server or a specific service. The case of an extensions server is probably easier (if the extension server and Envoy are managed by the same entity), as the extensions server should prevent accepting extensions that clear the routes, and this "contract" should be enforced by the extensions server. The added benefits for a "contract" for the extensions server are: 1) it can clearly reason what kind of extensions can be executed using the Envoy; 2) allows the operators to debug things on their end and do not need to access Envoy's logs/stats. I think it is better to decouple the things as much as possible, and as things stand, if the Envoy deployer updates the ext-proc config on their side, they may break the third-party extension server.

Please first let me confirm I understand your "contract" and "this contract should be enforced by the extensions server" correctly. Are you suggesting : 1) ext_proc reports that setting clear_route_cache is off to the external processor (to which ext_proc call out) 2) External processor evaluate it and make the final decision whether clear route cache or not.

AFAIU, this is contrary to what we want to achieve here because the external processor can choose to ignore ext_proc setting and that is out of our control.

I think ext_proc can report its setting/feature to external processor to help their observability but our end goal here is enforcement on ext_proc side not external processor side. @htuch Please correct me if I misunderstand our goal.

@adisuissa
Copy link
Contributor

Please first let me confirm I understand your "contract" and "this contract should be enforced by the extensions server" correctly. Are you suggesting : 1) ext_proc reports that setting clear_route_cache is off to the external processor (to which ext_proc call out) 2) External processor evaluate it and make the final decision whether clear route cache or not.

AFAIU, this is contrary to what we want to achieve here because the external processor can choose to ignore ext_proc setting and that is out of our control.

I think ext_proc can report its setting/feature to external processor to help their observability but our end goal here is enforcement on ext_proc side not external processor side. @htuch Please correct me if I misunderstand our goal.

This is similar to how Envoy talks with an xDS-server and sends a list of all supported extensions. It allows the server to decide what kind of configs to send back. If the server sends a config with a wrong extension, the client can reject the config.

In the ext-proc case, the Envoy will send the "contract" (supported abilities). If it is configured to disable route-cleaning, it will disregard whatever the ext-proc service returns in the field (just as this PR suggests). The only difference is that the ext-proc service operator can clearly observe that the Envoy that connects to it does or doesn't support some feature and they can debug things on their end without needing to look at the logs/counters on the Envoy side.

@tyxia
Copy link
Member

tyxia commented Apr 14, 2023

Please first let me confirm I understand your "contract" and "this contract should be enforced by the extensions server" correctly. Are you suggesting : 1) ext_proc reports that setting clear_route_cache is off to the external processor (to which ext_proc call out) 2) External processor evaluate it and make the final decision whether clear route cache or not.
AFAIU, this is contrary to what we want to achieve here because the external processor can choose to ignore ext_proc setting and that is out of our control.
I think ext_proc can report its setting/feature to external processor to help their observability but our end goal here is enforcement on ext_proc side not external processor side. @htuch Please correct me if I misunderstand our goal.

This is similar to how Envoy talks with an xDS-server and sends a list of all supported extensions. It allows the server to decide what kind of configs to send back. If the server sends a config with a wrong extension, the client can reject the config.

In the ext-proc case, the Envoy will send the "contract" (supported abilities). If it is configured to disable route-cleaning, it will disregard whatever the ext-proc service returns in the field (just as this PR suggests). The only difference is that the ext-proc service operator can clearly observe that the Envoy that connects to it does or doesn't support some feature and they can debug things on their end without needing to look at the logs/counters on the Envoy side.

I am not sure if we need full/similar capability of xDS for service callout since data <->control plane seems to be more complicated (at least at this moment). Though, the mechanism you described sound like a good design . Maybe we can evaluate to see if a subset of feature will be sufficient

If it is configured to disable route-cleaning, it will disregard whatever the ext-proc service returns in the field (just as this PR suggests)

Since we need this discard behavior anyways to enable ext_proc side enforcement, then IMHO we can proceed with this PR. Adding contracts can be a separate part that will be added later (e.g., as post-MVP) if it is determined to be desired for service callout. WDYT?

@adisuissa
Copy link
Contributor

Since we need this discard behavior anyways to enable ext_proc side enforcement, then IMHO we can proceed with this PR. Adding contracts can be a separate part that will be added later (e.g., as post-MVP) if it is determined to be desired for service callout. WDYT?

Yeah, as stated in the first comment, this is probably not part of this PR.

@htuch
Copy link
Member

htuch commented Apr 14, 2023

@adisuissa yeah, I can see this being something to add in the future as the range of behaviors becomes harder to debug.

I do think we need something here that is more than silent failure though, that is basically impossible to debug, even in the situation where you control client / server. Typically a stat would make sense and/or we could consider this a protocol violation and terminate the connection with the ext_proc server.

Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, Could you add the stat to track the case that common_response.clear_route_cache() = true && filter_.config().disableRouteCacheClearing() = true , as discussed in the comments.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Left minor comments on the API.
Also +1 on the prior comment to add a stat to track this.
/wait

Signed-off-by: Joseph Straceski <jstraceski@google.com>
Signed-off-by: Joseph Straceski <jstraceski@google.com>
@adisuissa
Copy link
Contributor

CI error:
/tmp/tmp2etuagir/generated/rst/version_history/v1.27/v1.27.0.rst:76: WARNING: undefined label: 'envoy_v3_api_field_service.ext_proc.v3.externalprocessor.disable_clear_route_cache'

seems that the field identifier is incorrect. Can you PTAL?

@adisuissa
Copy link
Contributor

Snow is OOO.
@wbpcode can you review this code as a non-Googler?
/assign @wbpcode

Signed-off-by: Joseph Straceski <jstraceski@google.com>
Signed-off-by: Joseph Straceski <jstraceski@google.com>
Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions. And only some comments need to be addressed, others LGTM.

@wbpcode
Copy link
Member

wbpcode commented Apr 29, 2023

/wait

Signed-off-by: Joseph Straceski <jstraceski@google.com>
tyxia
tyxia previously approved these changes Apr 30, 2023
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Joseph Straceski <jstraceski@google.com>
wbpcode
wbpcode previously approved these changes May 1, 2023
Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@wbpcode
Copy link
Member

wbpcode commented May 1, 2023

ping @adisuissa to merge after ci is done.

tyxia
tyxia previously approved these changes May 1, 2023
@jstraceski
Copy link
Contributor Author

@adisuissa /ping should be all set to be merged, I believe.

@adisuissa
Copy link
Contributor

Needs a senior maintainer review.
/assign @htuch

@jstraceski jstraceski changed the title Adding modification check to clearing the route cache of the ext_proc filter. Updating ext_proc filter clear route cache logic. May 4, 2023
Signed-off-by: Joseph Straceski <jstraceski@google.com>
@jstraceski jstraceski dismissed stale reviews from tyxia and wbpcode via 0e19f9c May 4, 2023 13:12
filter_callbacks_->downstreamCallbacks()->clearRouteCache();
} else {
filter_.stats().clear_route_cache_ignored_.inc();
ENVOY_LOG(debug, "NOT clearing route cache, no header mutations detected");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can I ask why we bundled the "avoid clearing route cache when no header mutations" into this PR?

IMHO it probably makes sense, although there might be valid reasons (albeit contrived) to use ext_proc to force a repick without this, e.g. if an earlier filter had set some headers and we're not having them take effect. As I said, a bit contrived and it's probably fine, just not sure why we needed it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had included it in the scope of this pr earlier on because I had assumed it was part of the plan for the change, it was mentioned in one of the readiness documents in some notes. It seemed related enough to include.
Though, it does make sense to split it up into another pr. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, if earlier filter mutate the header to affect route selection, technically that filter should clear route cache itself in place. Leaving it to other filters to clear seems not very robust behavior to me.

Yea, i don't have strong opinion. Either this PR or separate PR works for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't see any objections to combining, but please keep in mind for future PRs that ideally we have orthogonal PRs to support things like bisecting and change / risk management.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@htuch htuch merged commit a3f01a5 into envoyproxy:main May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants