[Question] Loki plugin out of order error behaviour? #3082

stevehipwell · 2021-02-16T09:49:46Z

I'm interested in the behaviour when Loki returns an out of order error to fluent-bit? I'm hoping that the log entry in question is discarded after the error is logged and then the rest of the chunk continues to be processed?

vladimirfx · 2021-02-24T20:05:00Z

Currently observed behavior (1.7.1):

Out of order chunk goes to retry queue
New chunks are waiting while retry is success (witch was never happens)
Retry timeout grows exponentially so few incorrect chunks effectively blocks all output to Loki AND because of Fluent bit stops sending logs to Loki when any of the outputs goes offline #3088 eventually all outputs at all.

I think HTTP code 400 from Loki should be treated as Error status as stated in Fluentbit retry docs [1] and not be scheduled to retry. Actually HTTP code 400 must not be retried with same request according to HTTP spec. Configuration option will be good compromise ;)

P.S.: now this behavior is showstopper for out adoption of Fluentbit in Kubernetes env.

[1] https://github.com/fluent/fluent-bit-docs/blob/master/administration/scheduling-and-retries.md

vladimirfx · 2021-02-24T20:11:23Z

Bad workaround 1:

Retry_Limit 1

But if you have at least of hundreds of "bad" chunks it is not not save you because of uncontrollable retry pause and not disablable retry logic.

Bad workaround 2 - disable timestamp parsing on your inputs.

vladimirfx · 2021-02-25T17:39:13Z

@edsiper Sorry for mention... My observation is correct? Can Loki output threat 400 code as fatal error and skip that chunk?

stevehipwell · 2021-02-25T20:11:45Z

@vladimirfx are you taking about a whole chunk or a single record?

vladimirfx · 2021-02-26T07:07:12Z

@vladimirfx are you taking about a whole chunk or a single record?

Stuck whole chunk. I don't know how many records in chunk but in log I see:

[2021/02/26 06:52:01] [ warn] [engine] failed to flush chunk '1-1614318381.304111707.flb', retry in 8 seconds: task_id=1488, input=tail.1 > output=loki.0 (out_id=0)
[2021/02/26 06:52:09] [error] [output:loki:loki.0] loki.logging.svc:3100, HTTP status=400
entry with timestamp 2021-02-26 05:46:21.347975574 +0000 UTC ignored, reason: 'entry out of order' for stream: {container_name="grafana-operator", host="dev-k3s02", job="fluentbit-kube", name="grafana-operator", namespace="monitoring", pod_template_hash="8545598579"},
total ignored: 1 out of 1

1 out of 1 - does not this mean 1 record per chunk?

It would be nice to skip only invalid record but it way more complex task.

Unfortunately log quality (its ordering) is uncontrollable thing and we have at least 4 cases (pods that emits out of order records regularly) in relatively small cluster. Ability to ignore this errors is vital for log shipping for whole cluster.

There

fluent-bit/plugins/out_loki/loki.c

Line 1129 in 1e8c002

if (c->resp.status < 200 || c->resp.status > 205) {

all unsuccessfull response codes threaded as retryable but 400 is special case and should return FLB_ERROR I think.

giovanni-ferrari · 2021-02-26T11:33:52Z

Currently observed behavior (1.7.1):

Out of order chunk goes to retry queue

New chunks are waiting while retry is success (witch was never happens)

Retry timeout grows exponentially so few incorrect chunks effectively blocks all output to Loki AND because of Fluent bit stops sending logs to Loki when any of the outputs goes offline #3088 eventually all outputs at all.

I think HTTP code 400 from Loki should be treated as Error status as stated in Fluentbit retry docs [1] and not be scheduled to retry. Actually HTTP code 400 must not be retried with same request according to HTTP spec. Configuration option will be good compromise ;)

P.S.: now this behavior is showstopper for out adoption of Fluentbit in Kubernetes env.

[1] https://github.com/fluent/fluent-bit-docs/blob/master/administration/scheduling-and-retries.md

Hello, may I ask if you know what is the relation between plugins like ES and Loki so that if ES output fails to forward logs then also other pluging like Loki are impacted, while other like "stdout" keep working ?

vladimirfx · 2021-02-27T07:05:40Z

@giovanni-ferrari I don’t know a reason of correlation between output plugins.

github-actions · 2021-03-30T02:25:57Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

stevehipwell · 2021-03-30T09:33:58Z

I'm still interested in a full definition of this behaviour.

github-actions · 2021-05-04T01:52:00Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

stevehipwell · 2021-05-04T07:31:30Z

I'd still like this to be documented.

github-actions · 2021-06-04T02:15:12Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

stevehipwell · 2021-06-07T07:47:24Z

I'd still like this to be documented.

mnp · 2021-06-17T20:35:52Z

We are seeing this also and it's a showstopper for us as well.

Our usecase is the opposite of OP's: In our case, we DON'T want ANYTHING discarded, ever, but we don't care about the log time being precise; collection time is good enough as long as it passes the original through somehow. So to achieve that, we're trying json logging of kubernetes stdout containers so we have time_keep off in the parser.

custom_parsers.conf:
----
[PARSER]
    Name docker_no_time
    Format json
    Time_Keep off

fluent-bit.conf:
----
[SERVICE]
    Flush 1
    Daemon Off
    Log_Level info
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Parser docker_no_time
    Tag kube.*
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On
[INPUT]
    Name systemd
    Tag host.*
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    Read_From_Tail On

[FILTER]
    Name kubernetes
    Match *
    Merge_Log Off
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[OUTPUT]
    name loki
    match kube.*
    tenant_id my-loki
    host lok1-loki
    port 3100
    line_format json
    auto_kubernetes_labels off

Is that the right way to ask for collection time instead of log time?

Anyway we still get out of order and dropped chunks.

[2021/06/17 20:31:19] [ warn] [engine] chunk '1-1623961610.585590800.flb' cannot be retried: task_id=43, input=tail.0 > output=loki.0
[2021/06/17 20:31:24] [error] [output:loki:loki.0] lok1-loki:3100, HTTP status=400
entry with timestamp 2021-06-17 20:27:02.5381778 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5381858 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5381926 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382005 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382069 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382131 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.538361 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
total ignored: 7 out of 7

Thanks for letting me tag along.

Kampe · 2021-07-16T04:14:11Z

Seeing this same issue with 1.8.1 and Loki 2.2.1

edsiper · 2021-07-19T18:19:26Z

the final solution is batching queue implementation (will be implemented this year), for now, we will merge #3785 to avoid situations that chunks are retried forever, in that case, they should be skipped.,

stevehipwell · 2021-07-20T10:12:39Z

Thanks for the fix for 400 errors @edsiper, that will make a big difference for now. I think the work that @owen-d is doing on grafana/loki#1544 should help or even solve this, personally it should allow me to move back to just pushing my logs to Fluentd from Fluent Bit and allowing Fluentd to push to Loki.

github-actions · 2021-08-20T01:47:20Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

mnp · 2021-08-20T03:23:08Z

This should not be closed; it's still a concern.

github-actions · 2021-10-01T01:52:51Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

JeffLuoo · 2021-10-04T18:28:18Z

Loki seems already support accepting out-of-order logs, please check: https://github.com/grafana/loki/blob/11a0d28b611f834c81eee206e26a66b057e443a2/docs/sources/configuration/_index.md#accept-out-of-order-writes

stevehipwell · 2021-10-04T18:31:16Z

@JeffLuoo only Loki in Grafana Cloud is accepting out of order writes, for the OSS version we're waiting on v2.4.0.

This question is still valid as the changes just enable an write window, and anything outside that will be out of order.

github-actions · 2022-01-04T01:52:27Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions · 2022-01-09T01:54:43Z

This issue was closed because it has been stalled for 5 days with no activity.

vladimirfx mentioned this issue Mar 6, 2021

sink(loki): entry out of order vectordotdev/vector#6651

Closed

This was referenced Mar 29, 2021

Loki: replace client from Promtail to fluent-bit cybozu-go/neco#1469

Closed

Robust log agent for loki cybozu-go/neco#1497

Open

github-actions bot added the Stale label Mar 30, 2021

github-actions bot removed the Stale label Mar 31, 2021

github-actions bot added the Stale label May 4, 2021

github-actions bot removed the Stale label May 5, 2021

github-actions bot added the Stale label Jun 4, 2021

github-actions bot removed the Stale label Jun 8, 2021

mnp mentioned this issue Jun 18, 2021

Fast dummy input causes 'entry out of order' from Loki - rate dependent? #3666

Closed

nokute78 mentioned this issue Jul 17, 2021

out_loki: Do not retry when loki returns 400. #3785

Merged

2 tasks

github-actions bot added the Stale label Aug 20, 2021

github-actions bot removed the Stale label Aug 21, 2021

github-actions bot added the Stale label Oct 1, 2021

github-actions bot removed the Stale label Oct 5, 2021

github-actions bot added the Stale label Jan 4, 2022

github-actions bot closed this as completed Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Loki plugin out of order error behaviour? #3082

[Question] Loki plugin out of order error behaviour? #3082

stevehipwell commented Feb 16, 2021

vladimirfx commented Feb 24, 2021 •

edited

Loading

vladimirfx commented Feb 24, 2021

vladimirfx commented Feb 25, 2021

stevehipwell commented Feb 25, 2021

vladimirfx commented Feb 26, 2021

giovanni-ferrari commented Feb 26, 2021 •

edited

Loading

vladimirfx commented Feb 27, 2021

github-actions bot commented Mar 30, 2021

stevehipwell commented Mar 30, 2021

github-actions bot commented May 4, 2021

stevehipwell commented May 4, 2021

github-actions bot commented Jun 4, 2021

stevehipwell commented Jun 7, 2021

mnp commented Jun 17, 2021

Kampe commented Jul 16, 2021

edsiper commented Jul 19, 2021

stevehipwell commented Jul 20, 2021

github-actions bot commented Aug 20, 2021

mnp commented Aug 20, 2021

github-actions bot commented Oct 1, 2021

JeffLuoo commented Oct 4, 2021

stevehipwell commented Oct 4, 2021

github-actions bot commented Jan 4, 2022

github-actions bot commented Jan 9, 2022

[Question] Loki plugin out of order error behaviour? #3082

[Question] Loki plugin out of order error behaviour? #3082

Comments

stevehipwell commented Feb 16, 2021

vladimirfx commented Feb 24, 2021 • edited Loading

vladimirfx commented Feb 24, 2021

vladimirfx commented Feb 25, 2021

stevehipwell commented Feb 25, 2021

vladimirfx commented Feb 26, 2021

giovanni-ferrari commented Feb 26, 2021 • edited Loading

vladimirfx commented Feb 27, 2021

github-actions bot commented Mar 30, 2021

stevehipwell commented Mar 30, 2021

github-actions bot commented May 4, 2021

stevehipwell commented May 4, 2021

github-actions bot commented Jun 4, 2021

stevehipwell commented Jun 7, 2021

mnp commented Jun 17, 2021

Kampe commented Jul 16, 2021

edsiper commented Jul 19, 2021

stevehipwell commented Jul 20, 2021

github-actions bot commented Aug 20, 2021

mnp commented Aug 20, 2021

github-actions bot commented Oct 1, 2021

JeffLuoo commented Oct 4, 2021

stevehipwell commented Oct 4, 2021

github-actions bot commented Jan 4, 2022

github-actions bot commented Jan 9, 2022

vladimirfx commented Feb 24, 2021 •

edited

Loading

giovanni-ferrari commented Feb 26, 2021 •

edited

Loading