Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext_authz(http): Envoy turns AuthorizationResponse status 500 into 403 and drops body #14001

Open
yanniszark opened this issue Nov 12, 2020 · 27 comments

Comments

@yanniszark
Copy link

Title: ext_authz(http): Envoy turns AuthorizationResponse status 500 into 403 and drops body

Description:

  1. User makes an HTTP request against Envoy.
  2. Envoy makes AuthorizationRequest to ext_authz filter's http server.
  3. ext_authz server replies with 503 and a body.
  4. Envoy returns 403 and no body.

Repro steps:
This came up as part of a project so it's easier to include instructions for that project:

# Clone and checkout branch made for reproducing this issue (see last commit)
git clone git@github.com:arrikto/oidc-authservice.git
cd oidc-authservice
git checkout feature-envoy-ext-authz-repo

# Run the E2E test to get a K3d cluster up with everything running.
# The test will fail, but it's expected.
make bin/deps
make e2e

# Port-forward the service locally
export KUBECONFIG=$HOME/.k3d/kubeconfig-e2e-test-cluster.yaml
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80

# Make a request and see that it returns a 403
curl -v http://localhost:8080

Config:
Here is the config dump taken from port 15000 of the istio ingressgateway proxy:
config_dump.log

Logs:
access_log.log

@yanniszark yanniszark added bug triage Issue requires triage labels Nov 12, 2020
@derekargueta
Copy link
Member

I don't think this a bug, handling errors from the auth service service is either fail-open or fail-closed (default).

if (config_->failureModeAllow()) {
ENVOY_STREAM_LOG(trace, "ext_authz filter allowed the request with error", *callbacks_);
stats_.failure_mode_allowed_.inc();
if (cluster_) {
config_->incCounter(cluster_->statsScope(), config_->ext_authz_failure_mode_allowed_);
}
continueDecoding();
} else {
ENVOY_STREAM_LOG(
trace, "ext_authz filter rejected the request with an error. Response status code: {}",
*callbacks_, enumToInt(config_->statusOnError()));
callbacks_->streamInfo().setResponseFlag(
StreamInfo::ResponseFlag::UnauthorizedExternalService);
callbacks_->sendLocalReply(config_->statusOnError(), EMPTY_STRING, nullptr, absl::nullopt,
RcDetails::get().AuthzError);
}

Proxying an auth service error to the client may just be a feature request. You can set the error status code with status_on_error if you want to return a 500 instead

@mattklein123 mattklein123 added question Questions that are neither investigations, bugs, nor enhancements and removed bug triage Issue requires triage labels Nov 13, 2020
@yanniszark
Copy link
Author

yanniszark commented Nov 14, 2020

@derekargueta the documentation for status_on_error says:

Sets the HTTP status that is returned to the client when there is a network error between the filter and the authorization server. The default status is HTTP 403 Forbidden.

However, this is not the case. If I understand correctly, filter == Envoy and authorization server == my http authservice. In this case, there is no network error between them, but something went wrong in the authservice.

In addition, when I tried returning error code 400 Bad Request, the error code was not transformed into a 403 and the message was kept.

Which means this is a problem only with 500 (and maybe 5xx) http codes.
In the ext_authz code, is the case of status 500 and network failure between envoy/authservice handled as the same case? If so, why?
It seems weird to handle other codes (3xx, 4xx) correctly, but not 5xx codes.

@yanniszark
Copy link
Author

kind ping @derekargueta @mattklein123

@esmet
Copy link
Contributor

esmet commented Dec 10, 2020

This issue #4124 and the associated PR #4199 introduced this behavior, I think.

@yanniszark
Copy link
Author

Thanks @esmet! I see this is triaged as a question and I don't think it's correct.
@mattklein123 @derekargueta would you agree this is a bug, as described here:
#14001 (comment)

The ext_authz filter needs to distinguish between a network error and getting back HTTP 500, in which case it could reach the AuthService successfully.

@esmet
Copy link
Contributor

esmet commented Dec 10, 2020

@yanniszark I agree. It sounds like we want to add configuration to set the definition of an error for the external authorization server. Similar to https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on but of course we can tweak it so that it makes the most sense for the ext_authz use case. What do you think?

@alyssawilk feel free to assign this one to me

@yanniszark
Copy link
Author

yanniszark commented Dec 10, 2020

It sounds like we want to add configuration to set the definition of an error for the external authorization server

@esmet sorry, I'm not sure I understood that. IMO, all error returned by the AuthService should be surfaced to the downstream client. So if the AuthService returns 503 with a body, this response will be proxied back to the client, same as other non-2xx codes (3xx, 4xx).

@esmet
Copy link
Contributor

esmet commented Dec 10, 2020

By definition of error, I mean the case where a call to the authorization service is considered "failed" as opposed to "completed, but with an error".

Today, the ext_authz filter considers "completed, but with a 5xx error" to be "failed" though one could argue that the call succeeded, it just came back with a 5xx denial. This would fit the use case that you are describing.

We could add configuration to allow you to specify what you consider a failure. It could be just network errors, or network errors + 5xx errors, or something else, sort of like the 'retry-on' logic I linked above.

The default value for this configuration would be the current behavior, which is network errors and 5xx codes. For your use case, you'd configure it to only consider network errors as "failures" (for which the response code to the client is the preconfigured error response code) and everything else would be considered successful communication to the auth service, such that 2XX means pass through and everything else means deny with the code, headers, and body provided by the auth service.

@yanniszark
Copy link
Author

@esmet got it! IMO the current behavior of

Today, the ext_authz filter considers "completed, but with a 5xx error" to be "failed"

is a bug, but if you feel it makes sense to keep as a feature then all the power to you!

@dio
Copy link
Member

dio commented Dec 11, 2020

I agree that probably making the "definition of errors" to be configurable seems useful.

@esmet
Copy link
Contributor

esmet commented Dec 11, 2020

@esmet got it! IMO the current behavior of

Today, the ext_authz filter considers "completed, but with a 5xx error" to be "failed"

is a bug, but if you feel it makes sense to keep as a feature then all the power to you!

I could see why someone would want 5XX errors to be considered failures and not legitimate responses to send downstream. For example, one might not want someone to know that their auth service is actually down (or dying) or if some proxy between Envoy and the auth service is unhealthy.

At any rate, the existing behavior is now canon so that's the primary reason we'd to keep it as the default. @dio do you think mimicking the retry-on configuration is the right path forward?

I looked into the code and was sort of surprised to see that the retry-on values (gateway-error, 5xx, etc...) is largely implemented in code (retry_state_impl.cc) as opposed to a proto API that we could just reuse.

@dio
Copy link
Member

dio commented Dec 11, 2020

do you think mimicking the retry-on configuration is the right path forward?

I think so. Yes, it actually controlled by an array of pre-defined strings. Probably we can have something like: failed_on: ['gateway-error', '5xx']?. Thoughts?

@esmet
Copy link
Contributor

esmet commented Dec 12, 2020

do you think mimicking the retry-on configuration is the right path forward?

I think so. Yes, it actually controlled by an array of pre-defined strings. Probably we can have something like: failed_on: ['gateway-error', '5xx']?. Thoughts?

Sounds right. I'll try this soon.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 11, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@yanniszark
Copy link
Author

This is very much still an issue!

@esmet
Copy link
Contributor

esmet commented Jan 18, 2021

Let's move this to "no stale bot"

@mattklein123 mattklein123 reopened this Jan 18, 2021
@mattklein123 mattklein123 added help wanted Needs help! and removed question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently labels Jan 18, 2021
@ninadm
Copy link

ninadm commented Jun 30, 2021

No new activity on this yet? it would be super useful to differentiate between network drops vs 5xx errors

@mszlgr
Copy link

mszlgr commented Nov 15, 2021

Is this one still on road map? +1 on "it would be super useful"

@erplsf
Copy link

erplsf commented May 24, 2022

there are people that still need this - for example we have a flow where we return 5xx codes but they are intended to be handled specially by some applications.

@imrenagi
Copy link
Contributor

hi, can I pick this up? Thanks!

@dio
Copy link
Member

dio commented Sep 23, 2022

@imrenagi I think you can take it. Please let me know if you need help. Thanks!

cc. @ggreenway.

@dio
Copy link
Member

dio commented Oct 28, 2022

@imrenagi, after re-reading #14001 (comment). I think I agree with @yanniszark. So, we need to introduce runtime guarding, to deprecate the current behavior (with runtime guarding we allow users to fallback to the previous (current) behavior).

Basically, you need the following patch:

diff --git a/source/common/runtime/runtime_features.cc b/source/common/runtime/runtime_features.cc
index 36832d73f5..47e4297c58 100644
--- a/source/common/runtime/runtime_features.cc
+++ b/source/common/runtime/runtime_features.cc
@@ -43,6 +43,7 @@ RUNTIME_GUARD(envoy_reloadable_features_delta_xds_subscription_state_tracking_fi
 RUNTIME_GUARD(envoy_reloadable_features_do_not_await_headers_on_upstream_timeout_to_emit_stats);
 RUNTIME_GUARD(envoy_reloadable_features_do_not_count_mapped_pages_as_free);
 RUNTIME_GUARD(envoy_reloadable_features_enable_compression_bomb_protection);
+RUNTIME_GUARD(envoy_reloadable_features_ext_authz_http_service_5xx_is_denied);
 RUNTIME_GUARD(envoy_reloadable_features_fix_hash_key);
 RUNTIME_GUARD(envoy_reloadable_features_get_route_config_factory_by_type);
 RUNTIME_GUARD(envoy_reloadable_features_http2_delay_keepalive_timeout);
diff --git a/source/extensions/filters/common/ext_authz/ext_authz_http_impl.cc b/source/extensions/filters/common/ext_authz/ext_authz_http_impl.cc
index 3157136b43..a5d02c28dc 100644
--- a/source/extensions/filters/common/ext_authz/ext_authz_http_impl.cc
+++ b/source/extensions/filters/common/ext_authz/ext_authz_http_impl.cc
@@ -317,11 +317,14 @@ void RawHttpClientImpl::onBeforeFinalizeUpstreamSpan(
 ResponsePtr RawHttpClientImpl::toResponse(Http::ResponseMessagePtr message) {
   const uint64_t status_code = Http::Utility::getResponseStatus(message->headers());
 
-  // Set an error status if the call to the authorization server returns any of the 5xx HTTP error
-  // codes. A Forbidden response is sent to the client if the filter has not been configured with
-  // failure_mode_allow.
-  if (Http::CodeUtility::is5xx(status_code)) {
-    return std::make_unique<Response>(errorResponse());
+  if (!Runtime::runtimeFeatureEnabled(
+          "envoy.reloadable_features.ext_authz_http_service_5xx_is_denied")) {
+    // Set an error status if the call to the authorization server returns any of the 5xx HTTP error
+    // codes. A Forbidden response is sent to the client if the filter has not been configured with
+    // failure_mode_allow.
+    if (Http::CodeUtility::is5xx(status_code)) {
+      return std::make_unique<Response>(errorResponse());
+    }
   }
 
   // Extract headers-to-remove from the storage header coming from the

Hence with this, when this RawHttpClientImpl receives 5xx, it returns a response object with 5xx as the status code, with a non-empty body and all.

Note I put ext_authz_http_service_5xx_is_denied (meaning we are now classifying 5xx as a denied response, not an error one) as the name for the flag, feel free to choose a better name when you have it! (probably the longer version: ext_authz_http_service_5xx_is_denied_instead_of_error).

Please add unit and integration tests.

Good luck!

@imrenagi
Copy link
Contributor

Sure @dio . Will take a look and try to update my PR. thanks!

@dragoscalinescu
Copy link

+1 on this request.

There should be a way to make a difference between network failures when communicating with the external authorization services vs the external authorization service denies the request.

@yanniszark have you found any workarounds for this in the meantime?

@yanniszark
Copy link
Author

@dragoscalinescu the obvious workaround is to send a 40x error, which should go through, and include extra information in the response body.
Otherwise, I think getting involved in the PRs is the quickest way to solve this.
Seems like there was interest but the code didn't make it all the way though to upsteam.

@dragoscalinescu
Copy link

@yanniszark thanks for your reply. Initially I thought every error status code gets converted to the status_on_error value.

I figured it out you can do a 4xx from the authz service to bypass this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet