-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Loki plugin out of order error behaviour? #3082
Comments
Currently observed behavior (1.7.1):
I think HTTP code 400 from Loki should be treated as Error status as stated in Fluentbit retry docs [1] and not be scheduled to retry. Actually HTTP code 400 must not be retried with same request according to HTTP spec. Configuration option will be good compromise ;) P.S.: now this behavior is showstopper for out adoption of Fluentbit in Kubernetes env. [1] https://github.com/fluent/fluent-bit-docs/blob/master/administration/scheduling-and-retries.md |
Bad workaround 1:
But if you have at least of hundreds of "bad" chunks it is not not save you because of uncontrollable retry pause and not disablable retry logic. Bad workaround 2 - disable timestamp parsing on your inputs. |
@edsiper Sorry for mention... My observation is correct? Can Loki output threat 400 code as fatal error and skip that chunk? |
@vladimirfx are you taking about a whole chunk or a single record? |
Stuck whole chunk. I don't know how many records in chunk but in log I see:
1 out of 1 - does not this mean 1 record per chunk? It would be nice to skip only invalid record but it way more complex task. Unfortunately log quality (its ordering) is uncontrollable thing and we have at least 4 cases (pods that emits out of order records regularly) in relatively small cluster. Ability to ignore this errors is vital for log shipping for whole cluster. There fluent-bit/plugins/out_loki/loki.c Line 1129 in 1e8c002
|
Hello, may I ask if you know what is the relation between plugins like ES and Loki so that if ES output fails to forward logs then also other pluging like Loki are impacted, while other like "stdout" keep working ? |
@giovanni-ferrari I don’t know a reason of correlation between output plugins. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I'm still interested in a full definition of this behaviour. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I'd still like this to be documented. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I'd still like this to be documented. |
We are seeing this also and it's a showstopper for us as well. Our usecase is the opposite of OP's: In our case, we DON'T want ANYTHING discarded, ever, but we don't care about the log time being precise; collection time is good enough as long as it passes the original through somehow. So to achieve that, we're trying json logging of kubernetes stdout containers so we have time_keep off in the parser.
Is that the right way to ask for collection time instead of log time? Anyway we still get out of order and dropped chunks.
Thanks for letting me tag along. |
Seeing this same issue with 1.8.1 and Loki 2.2.1 |
the final solution is batching queue implementation (will be implemented this year), for now, we will merge #3785 to avoid situations that chunks are retried forever, in that case, they should be skipped., |
Thanks for the fix for 400 errors @edsiper, that will make a big difference for now. I think the work that @owen-d is doing on grafana/loki#1544 should help or even solve this, personally it should allow me to move back to just pushing my logs to Fluentd from Fluent Bit and allowing Fluentd to push to Loki. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This should not be closed; it's still a concern. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Loki seems already support accepting out-of-order logs, please check: https://github.com/grafana/loki/blob/11a0d28b611f834c81eee206e26a66b057e443a2/docs/sources/configuration/_index.md#accept-out-of-order-writes |
@JeffLuoo only Loki in Grafana Cloud is accepting out of order writes, for the OSS version we're waiting on v2.4.0. This question is still valid as the changes just enable an write window, and anything outside that will be out of order. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
I'm interested in the behaviour when Loki returns an out of order error to fluent-bit? I'm hoping that the log entry in question is discarded after the error is logged and then the rest of the chunk continues to be processed?
The text was updated successfully, but these errors were encountered: