Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considerable amount of GRPC error details can trigger ingress response header limit #11284

Open
megglos opened this issue Dec 16, 2022 · 10 comments
Assignees
Labels
component/gateway component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@megglos
Copy link
Contributor

megglos commented Dec 16, 2022

Description

In cases, like an invalid bpmn model, a considerable number of errors can occur. These are passed back to the client via the grpc-status-details-bin HTTP header, see.
In cases where Zeebe is accessed through a reverse proxy/ingress like nginx, this behavior can exceed response header limits of the proxy, resulting in error logs as:

[error] 2243#2243: *1868588 upstream sent too big header while reading response header from upstream

leading to a 502 response to the client by the proxy.

We may consider limiting the amount of error details returned to e.g. a limit of 5 entries, followed by a summary error entry if there are more errors "and N more errors". This would allow to still enforce a response header limit on any client facing proxy while making sure the client still gets back useful information to actually fix the model errors, no matter how much errors they provoked.

SUPPORT-15410

@megglos megglos added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Dec 16, 2022
@Zelldon
Copy link
Member

Zelldon commented Dec 16, 2022

See related discussion and investigation in slack https://camunda.slack.com/archives/CT702EPFH/p1671124412586789

@Zelldon Zelldon added scope/gateway Marks an issue or PR to appear in the gateway section of the changelog component/gateway labels Jan 4, 2023
@korthout
Copy link
Member

@megglos Can you clarify for us what the chances are of running into this situation? And what is the impact? Does the user receive a response at all?

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
@jfriedenstab
Copy link

Hi @korthout and @megglos,

I'm finally following up on this issue from the Web Modeler side, as we're still seeing the 502 Bad Gateway error quite often in our logs (> 300 times over the last 90 days). So the chances of running into this situation seem to be relatively high :).

The user will actually receive a response, but it's not really helpful:

image

Do you see any chances that the problem could be fixed on the Zeebe side any time soon?

@megglos
Copy link
Contributor Author

megglos commented Mar 20, 2024

@jfriedenstab 300 out of how many total deployment requests? 🤓

Would you consider a truncated error like suggested here

We may consider limiting the amount of error details returned to e.g. a limit of 5 entries, followed by a summary error entry if there are more errors "and N more errors"

would be a good solution to educate the user about the modeling errors?

The user may at least be aware of some specific errors and once resolved may learn about the next remaining ones. On the other hand wouldn't it be better to raise those issues directly in the modeler via validation to provide more helpful errors directly in the modeler (e.g on the affected elements)? If I'm no mistaken we have such validation already and I wonder how users can deploy invalid models from the modeler still 🤔

@jfriedenstab
Copy link

jfriedenstab commented Mar 20, 2024

Thanks a lot for your quick reply, @megglos!

300 out of how many total deployment requests? 🤓

I can't tell you the total number of requests in the 90-day period, but in the last 24 hours alone there were ~1200 deployment requests. Ok, so the ratio of requests that failed with the 502 error is pretty small 😊.

Would you consider a truncated error like suggested here would be a good solution?

Yes, I'd say it's a good solution 👍🏻.

Wouldn't it be better to raise those issues directly in the modeler via validation to provide more helpful errors directly in the modeler (e.g on the affected elements)?

I agree that, ideally, we would already show the user all the model errors before they try to deploy. However, I think the linting/validation rules that are in place are not able to catch all the possible error cases. So, some errors would only be detected when the model is actually deployed.

I wonder how users can deploy invalid models from the modeler still 🤔

True, we could prevent deployments if the linting detects errors. If I remember correctly though, we took a conscious decision a while ago to allow deployments anyway (sorry, I don't recall the details; edit: the reasoning behind this decision can be found in this Slack thread).

Anyway, models can also be deployed from outside the Web Modeler (via zbctl, etc.). So it would probably make sense to return a more helpful error message for these scenarios, too.

@jfriedenstab
Copy link

Hi @megglos,
Just a friendly reminder about this issue 🙂.
FYI: It might occur more often in Web Modeler when we release process applications (if there are are multiple invalid files in a deployment bundle).

@korthout korthout added kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels May 1, 2024
@korthout
Copy link
Member

korthout commented May 1, 2024

Thanks @jfriedenstab. I've moved it back into our inbox for triage.

As I understand, the problem leads to errors in the reverse proxy and the client not receiving the error response. That sounds like a bug to me. It's also high severity as there is no workaround available.

I find it difficult to specify a likelihood for this.

@jfriedenstab
Copy link

Thank you @korthout!

FYI: The Zeebe Java client will receive the following error response:

io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP status code 502
invalid content-type: text/html
headers: Metadata(:status=502,date=Wed, 28 Feb 2024 23:11:02 GMT,content-type=text/html,strict-transport-security=max-age=63072000; includeSubDomains,content-length=150)
DATA-----------------------------
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

Regarding the likelihood: The problem seems to occur regularly. Sentry (that we use for error tracking in Web Modeler) recorded 433 events from 75 different users over the last 90 days.

@megglos megglos added the likelihood/high A recurring issue label May 6, 2024
@megglos
Copy link
Contributor Author

megglos commented May 6, 2024

ZPA-Triage:

  • we consider the likelyhood high - thanks @jfriedenstab for the data ❤
  • solution idea: truncate the error message to a configurable character limit (most primitive)
  • we could check what limit we have in our infra on SaaS to determine a default value - magic
  • would be worth to check if there is some other way to transmission the error but we can consider that out of scope

@jfriedenstab
Copy link

would be worth to check if there is some other way to transmission the error but we can consider that out of scope

Maybe something to consider for the future when you add a deploy endpoint to the C8/Zeebe REST API (getting the errors back in a more structured format would also be nice) 🙃.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/gateway component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

No branches or pull requests

6 participants