Updating ext_proc filter clear route cache logic.#26388
Updating ext_proc filter clear route cache logic.#26388htuch merged 26 commits intoenvoyproxy:mainfrom
Conversation
|
/assign @tyxia |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
There was a problem hiding this comment.
Thanks for working on this! Left some comments.
Besides:
- Could you run the command
sudo ./ci/run_envoy_docker.sh '\''./ci/do_ci.sh format'\'' && sh bazel/setup_clang.sh /usr'to try to fix format, so that we can kick off CI/CD - Could you add a simple unit test? You probably can refer to
ClearRouteCachetest in ext_proc/filter_test.cc
source/extensions/filters/common/mutation_rules/mutation_rules.h
Outdated
Show resolved
Hide resolved
7e20490 to
67c352e
Compare
|
/retest |
|
Retrying Azure Pipelines: |
adisuissa
left a comment
There was a problem hiding this comment.
Thanks for adding this!
Left minor comments on the API.
A high-level comment (probably not in the scope of this PR), from the point-of-view of the external processor, it will be difficult to reason why the external-processor sent a response to clear the route cache, but the Envoy did not act accordingly. Specifically, the provider of an ext-proc service will see that their service works for Envoy deployment A (that allows cache clearing) but not for Envoy deployment B (that prevents cache clearing), but will not be able to debug why this is happening.
A better approach may be to "publish a contract" when connecting to the external-processor stating what features are enabled/disabled. @htuch WDYT?
The context/intention (we probably could add it to PR description) is: third-party extension server (which send back response) can do whatever they want. On the other hand, our product (which create ext_proc filter through filter config) want to force clearing behavior off regardless extension's behavior (e.g., due to product requirement, security, etc). So, ext_proc don't really need to set up the contract with extension or report for permission. I agree it is good to improve the observability/debuggability. We can add the log (,even stats) when |
I agree that from a security perspective this is a valid feature. FWIW, other fields should also be restricted, depending on the threat-model. I think we should address the general case, where the extension and the deployment are handled by different entities, and the extension can either be a third-part extensions server or a specific service. The case of an extensions server is probably easier (if the extension server and Envoy are managed by the same entity), as the extensions server should prevent accepting extensions that clear the routes, and this "contract" should be enforced by the extensions server.
I thought of suggesting counters (IMHO logs are typically less useful in large deployments), but the downsides are that stats require more memory, are useful to know if an event happens but not so useful to understand what triggered that event (e.g., which response from the external processor caused it), and don't scale if a similar fix will be needed on different fields. |
Please first let me confirm I understand your "contract" and "this contract should be enforced by the extensions server" correctly. Are you suggesting : 1) ext_proc reports that setting AFAIU, this is contrary to what we want to achieve here because the external processor can choose to ignore ext_proc setting and that is out of our control. I think ext_proc can report its setting/feature to external processor to help their observability but our end goal here is enforcement on ext_proc side not external processor side. @htuch Please correct me if I misunderstand our goal. |
This is similar to how Envoy talks with an xDS-server and sends a list of all supported extensions. It allows the server to decide what kind of configs to send back. If the server sends a config with a wrong extension, the client can reject the config. In the ext-proc case, the Envoy will send the "contract" (supported abilities). If it is configured to disable route-cleaning, it will disregard whatever the ext-proc service returns in the field (just as this PR suggests). The only difference is that the ext-proc service operator can clearly observe that the Envoy that connects to it does or doesn't support some feature and they can debug things on their end without needing to look at the logs/counters on the Envoy side. |
I am not sure if we need full/similar capability of xDS for service callout since data <->control plane seems to be more complicated (at least at this moment). Though, the mechanism you described sound like a good design . Maybe we can evaluate to see if a subset of feature will be sufficient
Since we need this discard behavior anyways to enable ext_proc side enforcement, then IMHO we can proceed with this PR. Adding contracts can be a separate part that will be added later (e.g., as post-MVP) if it is determined to be desired for service callout. WDYT? |
Yeah, as stated in the first comment, this is probably not part of this PR. |
|
@adisuissa yeah, I can see this being something to add in the future as the range of behaviors becomes harder to debug. I do think we need something here that is more than silent failure though, that is basically impossible to debug, even in the situation where you control client / server. Typically a stat would make sense and/or we could consider this a protocol violation and terminate the connection with the ext_proc server. |
tyxia
left a comment
There was a problem hiding this comment.
Mostly LGTM, Could you add the stat to track the case that common_response.clear_route_cache() = true && filter_.config().disableRouteCacheClearing() = true , as discussed in the comments.
Signed-off-by: Joseph Straceski <jstraceski@google.com>
Signed-off-by: Joseph Straceski <jstraceski@google.com>
|
CI error: seems that the field identifier is incorrect. Can you PTAL? |
Signed-off-by: Joseph Straceski <jstraceski@google.com>
Signed-off-by: Joseph Straceski <jstraceski@google.com>
wbpcode
left a comment
There was a problem hiding this comment.
Thanks for your contributions. And only some comments need to be addressed, others LGTM.
|
/wait |
Signed-off-by: Joseph Straceski <jstraceski@google.com>
Signed-off-by: Joseph Straceski <jstraceski@google.com>
|
ping @adisuissa to merge after ci is done. |
|
@adisuissa /ping should be all set to be merged, I believe. |
|
Needs a senior maintainer review. |
Signed-off-by: Joseph Straceski <jstraceski@google.com>
| filter_callbacks_->downstreamCallbacks()->clearRouteCache(); | ||
| } else { | ||
| filter_.stats().clear_route_cache_ignored_.inc(); | ||
| ENVOY_LOG(debug, "NOT clearing route cache, no header mutations detected"); |
There was a problem hiding this comment.
LGTM, but can I ask why we bundled the "avoid clearing route cache when no header mutations" into this PR?
IMHO it probably makes sense, although there might be valid reasons (albeit contrived) to use ext_proc to force a repick without this, e.g. if an earlier filter had set some headers and we're not having them take effect. As I said, a bit contrived and it's probably fine, just not sure why we needed it in this PR.
There was a problem hiding this comment.
I had included it in the scope of this pr earlier on because I had assumed it was part of the plan for the change, it was mentioned in one of the readiness documents in some notes. It seemed related enough to include.
Though, it does make sense to split it up into another pr. What do you think?
There was a problem hiding this comment.
IMHO, if earlier filter mutate the header to affect route selection, technically that filter should clear route cache itself in place. Leaving it to other filters to clear seems not very robust behavior to me.
Yea, i don't have strong opinion. Either this PR or separate PR works for me.
There was a problem hiding this comment.
OK, I don't see any objections to combining, but please keep in mind for future PRs that ideally we have orthogonal PRs to support things like bisecting and change / risk management.
Commit Message: Modifying clear route cache logic for the ext_proc filter.
Additional Description: Adding in the ability to disable route cache clearing ont he ext_proc filter. Changing the behavior of the clearing to check if there is a header mutation.
Risk Level: Low
Testing: test/extensions/filters/http/ext_proc/filter_test.cc
Docs Changes: docs/root/configuration/http/http_filters/ext_proc_filter.rst
Release Notes: changelogs/current.yaml
Platform Specific Features: