Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Loki plugin out of order error behaviour? #3082

Closed
stevehipwell opened this issue Feb 16, 2021 · 24 comments
Closed

[Question] Loki plugin out of order error behaviour? #3082

stevehipwell opened this issue Feb 16, 2021 · 24 comments
Labels

Comments

@stevehipwell
Copy link

I'm interested in the behaviour when Loki returns an out of order error to fluent-bit? I'm hoping that the log entry in question is discarded after the error is logged and then the rest of the chunk continues to be processed?

@vladimirfx
Copy link

vladimirfx commented Feb 24, 2021

Currently observed behavior (1.7.1):

  1. Out of order chunk goes to retry queue
  2. New chunks are waiting while retry is success (witch was never happens)
  3. Retry timeout grows exponentially so few incorrect chunks effectively blocks all output to Loki AND because of Fluent bit stops sending logs to Loki when any of the outputs goes offline #3088 eventually all outputs at all.

I think HTTP code 400 from Loki should be treated as Error status as stated in Fluentbit retry docs [1] and not be scheduled to retry. Actually HTTP code 400 must not be retried with same request according to HTTP spec. Configuration option will be good compromise ;)

P.S.: now this behavior is showstopper for out adoption of Fluentbit in Kubernetes env.

[1] https://github.com/fluent/fluent-bit-docs/blob/master/administration/scheduling-and-retries.md

@vladimirfx
Copy link

Bad workaround 1:

Retry_Limit 1

But if you have at least of hundreds of "bad" chunks it is not not save you because of uncontrollable retry pause and not disablable retry logic.

Bad workaround 2 - disable timestamp parsing on your inputs.

@vladimirfx
Copy link

@edsiper Sorry for mention... My observation is correct? Can Loki output threat 400 code as fatal error and skip that chunk?

@stevehipwell
Copy link
Author

@vladimirfx are you taking about a whole chunk or a single record?

@vladimirfx
Copy link

@vladimirfx are you taking about a whole chunk or a single record?

Stuck whole chunk. I don't know how many records in chunk but in log I see:

[2021/02/26 06:52:01] [ warn] [engine] failed to flush chunk '1-1614318381.304111707.flb', retry in 8 seconds: task_id=1488, input=tail.1 > output=loki.0 (out_id=0)
[2021/02/26 06:52:09] [error] [output:loki:loki.0] loki.logging.svc:3100, HTTP status=400
entry with timestamp 2021-02-26 05:46:21.347975574 +0000 UTC ignored, reason: 'entry out of order' for stream: {container_name="grafana-operator", host="dev-k3s02", job="fluentbit-kube", name="grafana-operator", namespace="monitoring", pod_template_hash="8545598579"},
total ignored: 1 out of 1

1 out of 1 - does not this mean 1 record per chunk?

It would be nice to skip only invalid record but it way more complex task.

Unfortunately log quality (its ordering) is uncontrollable thing and we have at least 4 cases (pods that emits out of order records regularly) in relatively small cluster. Ability to ignore this errors is vital for log shipping for whole cluster.

There

if (c->resp.status < 200 || c->resp.status > 205) {
all unsuccessfull response codes threaded as retryable but 400 is special case and should return FLB_ERROR I think.

@giovanni-ferrari
Copy link

giovanni-ferrari commented Feb 26, 2021

Currently observed behavior (1.7.1):

  1. Out of order chunk goes to retry queue
  2. New chunks are waiting while retry is success (witch was never happens)
  3. Retry timeout grows exponentially so few incorrect chunks effectively blocks all output to Loki AND because of Fluent bit stops sending logs to Loki when any of the outputs goes offline #3088 eventually all outputs at all.

I think HTTP code 400 from Loki should be treated as Error status as stated in Fluentbit retry docs [1] and not be scheduled to retry. Actually HTTP code 400 must not be retried with same request according to HTTP spec. Configuration option will be good compromise ;)

P.S.: now this behavior is showstopper for out adoption of Fluentbit in Kubernetes env.

[1] https://github.com/fluent/fluent-bit-docs/blob/master/administration/scheduling-and-retries.md

Hello, may I ask if you know what is the relation between plugins like ES and Loki so that if ES output fails to forward logs then also other pluging like Loki are impacted, while other like "stdout" keep working ?

@vladimirfx
Copy link

@giovanni-ferrari I don’t know a reason of correlation between output plugins.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 30, 2021
@stevehipwell
Copy link
Author

I'm still interested in a full definition of this behaviour.

@github-actions github-actions bot removed the Stale label Mar 31, 2021
@github-actions
Copy link
Contributor

github-actions bot commented May 4, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 4, 2021
@stevehipwell
Copy link
Author

I'd still like this to be documented.

@github-actions github-actions bot removed the Stale label May 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 4, 2021
@stevehipwell
Copy link
Author

I'd still like this to be documented.

@github-actions github-actions bot removed the Stale label Jun 8, 2021
@mnp
Copy link

mnp commented Jun 17, 2021

We are seeing this also and it's a showstopper for us as well.

Our usecase is the opposite of OP's: In our case, we DON'T want ANYTHING discarded, ever, but we don't care about the log time being precise; collection time is good enough as long as it passes the original through somehow. So to achieve that, we're trying json logging of kubernetes stdout containers so we have time_keep off in the parser.

custom_parsers.conf:
----
[PARSER]
    Name docker_no_time
    Format json
    Time_Keep off

fluent-bit.conf:
----
[SERVICE]
    Flush 1
    Daemon Off
    Log_Level info
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Parser docker_no_time
    Tag kube.*
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On
[INPUT]
    Name systemd
    Tag host.*
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    Read_From_Tail On

[FILTER]
    Name kubernetes
    Match *
    Merge_Log Off
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[OUTPUT]
    name loki
    match kube.*
    tenant_id my-loki
    host lok1-loki
    port 3100
    line_format json
    auto_kubernetes_labels off

Is that the right way to ask for collection time instead of log time?

Anyway we still get out of order and dropped chunks.

[2021/06/17 20:31:19] [ warn] [engine] chunk '1-1623961610.585590800.flb' cannot be retried: task_id=43, input=tail.0 > output=loki.0
[2021/06/17 20:31:24] [error] [output:loki:loki.0] lok1-loki:3100, HTTP status=400
entry with timestamp 2021-06-17 20:27:02.5381778 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5381858 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5381926 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382005 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382069 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.5382131 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
entry with timestamp 2021-06-17 20:27:02.538361 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="fluent-bit"},
total ignored: 7 out of 7

Thanks for letting me tag along.

@Kampe
Copy link

Kampe commented Jul 16, 2021

Seeing this same issue with 1.8.1 and Loki 2.2.1

@edsiper
Copy link
Member

edsiper commented Jul 19, 2021

the final solution is batching queue implementation (will be implemented this year), for now, we will merge #3785 to avoid situations that chunks are retried forever, in that case, they should be skipped.,

@stevehipwell
Copy link
Author

Thanks for the fix for 400 errors @edsiper, that will make a big difference for now. I think the work that @owen-d is doing on grafana/loki#1544 should help or even solve this, personally it should allow me to move back to just pushing my logs to Fluentd from Fluent Bit and allowing Fluentd to push to Loki.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 20, 2021
@mnp
Copy link

mnp commented Aug 20, 2021

This should not be closed; it's still a concern.

@github-actions github-actions bot removed the Stale label Aug 21, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 1, 2021
@JeffLuoo
Copy link
Contributor

JeffLuoo commented Oct 4, 2021

@stevehipwell
Copy link
Author

@JeffLuoo only Loki in Grafana Cloud is accepting out of order writes, for the OSS version we're waiting on v2.4.0.

This question is still valid as the changes just enable an write window, and anything outside that will be out of order.

@github-actions github-actions bot removed the Stale label Oct 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jan 4, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2022

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as completed Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants