Considerable amount of GRPC error details can trigger ingress response header limit #11284

megglos · 2022-12-16T07:03:48Z

Description

In cases, like an invalid bpmn model, a considerable number of errors can occur. These are passed back to the client via the grpc-status-details-bin HTTP header, see.
In cases where Zeebe is accessed through a reverse proxy/ingress like nginx, this behavior can exceed response header limits of the proxy, resulting in error logs as:

[error] 2243#2243: *1868588 upstream sent too big header while reading response header from upstream

leading to a 502 response to the client by the proxy.

We may consider limiting the amount of error details returned to e.g. a limit of 5 entries, followed by a summary error entry if there are more errors "and N more errors". This would allow to still enforce a response header limit on any client facing proxy while making sure the client still gets back useful information to actually fix the model errors, no matter how much errors they provoked.

SUPPORT-15410

The text was updated successfully, but these errors were encountered:

Zelldon · 2022-12-16T07:13:57Z

See related discussion and investigation in slack https://camunda.slack.com/archives/CT702EPFH/p1671124412586789

korthout · 2023-01-20T09:37:37Z

@megglos Can you clarify for us what the chances are of running into this situation? And what is the impact? Does the user receive a response at all?

jfriedenstab · 2024-03-20T14:21:05Z

Hi @korthout and @megglos,

I'm finally following up on this issue from the Web Modeler side, as we're still seeing the 502 Bad Gateway error quite often in our logs (> 300 times over the last 90 days). So the chances of running into this situation seem to be relatively high :).

The user will actually receive a response, but it's not really helpful:

Do you see any chances that the problem could be fixed on the Zeebe side any time soon?

megglos · 2024-03-20T15:02:42Z

@jfriedenstab 300 out of how many total deployment requests? 🤓

Would you consider a truncated error like suggested here

We may consider limiting the amount of error details returned to e.g. a limit of 5 entries, followed by a summary error entry if there are more errors "and N more errors"

would be a good solution to educate the user about the modeling errors?

The user may at least be aware of some specific errors and once resolved may learn about the next remaining ones. On the other hand wouldn't it be better to raise those issues directly in the modeler via validation to provide more helpful errors directly in the modeler (e.g on the affected elements)? If I'm no mistaken we have such validation already and I wonder how users can deploy invalid models from the modeler still 🤔

jfriedenstab · 2024-03-20T16:37:46Z

Thanks a lot for your quick reply, @megglos!

300 out of how many total deployment requests? 🤓

I can't tell you the total number of requests in the 90-day period, but in the last 24 hours alone there were ~1200 deployment requests. Ok, so the ratio of requests that failed with the 502 error is pretty small 😊.

Would you consider a truncated error like suggested here would be a good solution?

Yes, I'd say it's a good solution 👍🏻.

Wouldn't it be better to raise those issues directly in the modeler via validation to provide more helpful errors directly in the modeler (e.g on the affected elements)?

I agree that, ideally, we would already show the user all the model errors before they try to deploy. However, I think the linting/validation rules that are in place are not able to catch all the possible error cases. So, some errors would only be detected when the model is actually deployed.

I wonder how users can deploy invalid models from the modeler still 🤔

True, we could prevent deployments if the linting detects errors. If I remember correctly though, we took a conscious decision a while ago to allow deployments anyway (~~sorry, I don't recall the details~~; edit: the reasoning behind this decision can be found in this Slack thread).

Anyway, models can also be deployed from outside the Web Modeler (via zbctl, etc.). So it would probably make sense to return a more helpful error message for these scenarios, too.

jfriedenstab · 2024-04-26T13:21:52Z

Hi @megglos,
Just a friendly reminder about this issue 🙂.
FYI: It might occur more often in Web Modeler when we release process applications (if there are are multiple invalid files in a deployment bundle).

korthout · 2024-05-01T14:46:28Z

Thanks @jfriedenstab. I've moved it back into our inbox for triage.

As I understand, the problem leads to errors in the reverse proxy and the client not receiving the error response. That sounds like a bug to me. It's also high severity as there is no workaround available.

I find it difficult to specify a likelihood for this.

jfriedenstab · 2024-05-03T12:03:37Z

Thank you @korthout!

FYI: The Zeebe Java client will receive the following error response:

io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP status code 502
invalid content-type: text/html
headers: Metadata(:status=502,date=Wed, 28 Feb 2024 23:11:02 GMT,content-type=text/html,strict-transport-security=max-age=63072000; includeSubDomains,content-length=150)
DATA-----------------------------
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

Regarding the likelihood: The problem seems to occur regularly. Sentry (that we use for error tracking in Web Modeler) recorded 433 events from 75 different users over the last 90 days.

megglos · 2024-05-06T13:01:24Z

ZPA-Triage:

we consider the likelyhood high - thanks @jfriedenstab for the data ❤
solution idea: truncate the error message to a configurable character limit (most primitive)
we could check what limit we have in our infra on SaaS to determine a default value - magic
would be worth to check if there is some other way to transmission the error but we can consider that out of scope

jfriedenstab · 2024-05-07T15:04:31Z

would be worth to check if there is some other way to transmission the error but we can consider that out of scope

Maybe something to consider for the future when you add a deploy endpoint to the C8/Zeebe REST API (getting the errors back in a more structured format would also be nice) 🙃.

megglos added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Dec 16, 2022

Zelldon mentioned this issue Dec 27, 2022

A huge rejection reason causes an overflow in the record metadata #6442

Open

Zelldon added scope/gateway Marks an issue or PR to appear in the gateway section of the changelog component/gateway labels Jan 4, 2023

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

korthout added kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels May 1, 2024

megglos added the likelihood/high A recurring issue label May 6, 2024

megglos assigned mustafadagher May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Considerable amount of GRPC error details can trigger ingress response header limit #11284

Considerable amount of GRPC error details can trigger ingress response header limit #11284

megglos commented Dec 16, 2022 •

edited by neal-dennis

Zelldon commented Dec 16, 2022

korthout commented Jan 20, 2023

jfriedenstab commented Mar 20, 2024

megglos commented Mar 20, 2024 •

edited

jfriedenstab commented Mar 20, 2024 •

edited

jfriedenstab commented Apr 26, 2024

korthout commented May 1, 2024

jfriedenstab commented May 3, 2024

megglos commented May 6, 2024

jfriedenstab commented May 7, 2024

Considerable amount of GRPC error details can trigger ingress response header limit #11284

Considerable amount of GRPC error details can trigger ingress response header limit #11284

Comments

megglos commented Dec 16, 2022 • edited by neal-dennis

Zelldon commented Dec 16, 2022

korthout commented Jan 20, 2023

jfriedenstab commented Mar 20, 2024

megglos commented Mar 20, 2024 • edited

jfriedenstab commented Mar 20, 2024 • edited

jfriedenstab commented Apr 26, 2024

korthout commented May 1, 2024

jfriedenstab commented May 3, 2024

megglos commented May 6, 2024

jfriedenstab commented May 7, 2024

megglos commented Dec 16, 2022 •

edited by neal-dennis

megglos commented Mar 20, 2024 •

edited

jfriedenstab commented Mar 20, 2024 •

edited