Add out_kafka example #16

solsson · 2018-01-16T19:07:37Z

This is a rebase of #11. In addition:

Adds example memory limits.
Adds annotations for the new prometheus exporter feature.

with Yolean/kubernetes-kafka brokers.

…ing due to retries

StevenACoffman · 2018-01-17T03:52:45Z

@solsson I know this is merged, but I just applied this (changed topic and namespace, otherwise identical) and am getting this:

kangaroo/fluent-bit-rq45j[fluent-bit]: [2018/01/17 03:42:04] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-rq45j[fluent-bit]: [2018/01/17 03:42:04] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-dncm5[fluent-bit]: [2018/01/17 03:42:12] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-dncm5[fluent-bit]: [2018/01/17 03:42:12] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-lw88d[fluent-bit]: [2018/01/17 03:42:13] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-lw88d[fluent-bit]: [2018/01/17 03:42:13] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [engine] started
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [engine] started
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:51] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:51] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:55] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: socket: No buffer space available
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [io] could not create socket
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [filter_kube] upstream connection error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:57] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:57] [error] [plugins/in_tail/tail_file.c:742 errno=2] No such file or directory
kangaroo/fluent-bit-77jx4[fluent-bit]: [2018/01/17 03:47:08] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-77jx4[fluent-bit]: [2018/01/17 03:47:08] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [engine] started
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:49:04] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:49:04] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [engine] started
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:14] [ info] [filter_kube] API server connectivity OK

I'm going to investigate more in the morning.

solsson · 2018-01-17T07:13:23Z

@StevenACoffman Could it be the memory limits? You probably have significantly higher log volumes than I do. "No buffer space available" could indicate that.

StevenACoffman · 2018-01-17T14:14:22Z

That was certainly one problem. I removed the limits, but am still seeing some failures, and the pods never are restarted successfully, as in #17

solsson · 2018-01-17T14:32:18Z

I get only the same kafka disconnects (~5 per hour per pod) as with fluent-bit-kafka-dev:0.5. Running the limits from the minikube example in QA, max 10Mi, and got one OOMKilled for 6 pods over 19h. With the old image I suspect I had memory use increasing slowly over time, without any increase in log volumes, so I'm trying to spot that with tight limits.

solsson · 2018-01-22T09:33:27Z

@StevenACoffman What highs do you see in memory use with your load? I'll create a PR that increases the limits, but not too far because I want to see if buffers grow (#11 (comment)) or there's memory leaks.

StevenACoffman · 2018-01-22T16:43:51Z

These are currently the highest metrics I have in one randomly selected cluster:

container_memory_cache{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 28672
container_memory_rss{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 36864
container_memory_usage_bytes{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 573440
container_network_receive_bytes_total{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",interface="eth0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 5.69402814e+08

I've been focussing on getting things to run reliably (both kafka and fluent-bit) before fine-tuning memory limits. Most of my fluent-bit pods fell over pretty quickly in a busy cluster with the limits you have. Are you scraping the new prometheus endpoint in your test clusters? I would expect that to add a bit of overhead.

The fluent-bit memory usage Documentation has:

So, if we impose a limit of 10MB for the input plugins and considering the worse case scenario of the output plugin consuming 20MB extra, as a minimum we need (30MB x 1.2) = 36MB.

Given the Mem_Buf_Limit 5MB, then we either need 13 MB, not counting the prometheus endpoint. I'm seeing usage higher than that.

edsiper · 2018-01-22T16:46:17Z

I think the problem is the lack of "check" in out_kafka: when we deliver data to kafka, we ingest the records into Kafka thread which has it own memory buffer, on that point we do ACK to Fluent Bit engine, instead we should wait for those records to be flushed before to ACK, otherwise Fluent Bit keeps ingesting records into out_kafka

StevenACoffman · 2018-01-22T18:38:07Z

I looked in the cAdvisor UI for this container, which has been up for 3 days. I'm going to terminate it and look again.

StevenACoffman · 2018-01-22T19:15:32Z

After 30 minutes, it stayed happily at 416K. The other ones that are using a lot of memory are complaining in their logs:

[2018/01/22 19:10:50] [debug] [filter_kube] could not merge log as requested
[2018/01/22 19:10:50] [debug] [input tail.0] [mem buf] size = 5184238
[2018/01/22 19:10:50] [debug] [input] tail.0 paused (mem buf overlimit)

This is the cluster that I've been breaking kafka in interesting ways, so perhaps it is related to Eduardo's comment above.

StevenACoffman · 2018-01-22T19:36:12Z

What is weird is that many of the fluent-bit pods that have been up for 3+ days in that same cluster stay around 0.5 MB (e.g., 516K, 492K, 416K). When they go over memory limits, they go way over. Maybe the nodes where fluent-bit stays low are not generating as many log files?

solsson · 2018-01-22T20:23:14Z

Are you scraping the new prometheus endpoint in your test clusters?

Yes we do. I agree this should be accounted for in the recommended limits.

So, if we impose a limit of 10MB for the input plugins and considering the worse case scenario of the output plugin consuming 20MB extra, as a minimum we need (30MB x 1.2) = 36MB.

Given the Mem_Buf_Limit 5MB, then we either need 13 MB, not counting the prometheus endpoint. I'm seeing usage higher than that.

Then 20Mi could in theory be sufficient, with prometheus exports? Of course it's very annoying to have restarts of these pods, and it might affect testing negatively. A memory limit of twice the expected use isn't unreasonable, and over time metrics could tell if it can safely be reduced.

edsiper · 2018-01-22T20:24:57Z

hmm I think the missing part about resources usage depends on how much memory the output plugins also requires, definitely out kafka due to buffering may consume a lot

solsson · 2018-01-22T20:45:50Z

hmm I think the missing part about resources usage depends on how much memory the output plugins also requires, definitely out kafka due to buffering may consume a lot

It's interesting that @StevenACoffman runs with experimental kafka config, that could cause producer errors.

I guess anyone who likes to tweak limits can make a calculation based on average fluentbit_output_proc_bytes_total and how much kafka downtime they want to handle before dropping log messages.

I think the problem is the lack of "check" in out_kafka: when we deliver data to kafka, we ingest the records into Kafka thread which has it own memory buffer, on that point we do ACK to Fluent Bit engine, instead we should wait for those records to be flushed before to ACK, otherwise Fluent Bit keeps ingesting records into out_kafka

@edsiper In other words, as hitting the memory limit means pod restart, fluent-bit will not re-process from sources the messages that had gone into the buffer?

What's the effect with Retry_Limit false (or an actual value) here?

solsson · 2018-01-23T08:10:24Z

@edsiper I should set request.required.acks=1 librdkafka property in the example daemonset's config. It's unlikely that anyone would like it higher for logs, and it's a useful example for those who want to change it to 0. How do I do that?

edsiper · 2018-01-29T17:59:02Z

@solsson I've created fluent/fluent-bit#495 to track the enhancement.

solsson added 13 commits January 16, 2018 18:08

Copies output/elasticsearch/

d5fed4e

Configures current dev build of out_kafka,

59d33bd

with Yolean/kubernetes-kafka brokers.

Bumps to latest build, tag 0.2

7a07033

Upgrades to :0.3 but keeps unencoded format

1a45779

New build from fluent/fluent-bit#94

69e9fd0

Elaborates on a naming convention

ad00da9

WIP in Yolean/kubernetes-kafka#101

Uses the bootstrap service, to avoid error messages in smaller clusters

dcb02a1

Recommends that we retry sending, forever - TODO: memory limits

61e558d

Upgrades to latest fluent-bit-kafka-dev image, librdkafka 0.11.3

37d3f65

Draft instructions on out_kafka

6f8e2f5

Uses 0.13-dev image, same as for Elasticsearch

105ed11

Applies resource limits, tight on memory to try to catch buffers grow…

8ce9311

…ing due to retries

Enables metrics endpoints and annotates for Prometheus service discovery

0e1951a

solsson mentioned this pull request Jan 16, 2018

Add out_kafka daemonset #11

Closed

Enables parser annotation support, as in the ES example

92c4960

solsson mentioned this pull request Jan 16, 2018

Add out_kafka output fluent/fluent-bit#94

Closed

edsiper merged commit 1263a3b into fluent:0.13-dev Jan 16, 2018

StevenACoffman mentioned this pull request Jan 17, 2018

Unstable using latest from 0.13-dev branch #17

Open

solsson mentioned this pull request Jan 22, 2018

epoll_ctl: File exists, errno=17 fluent/fluent-bit#488

Closed

solsson mentioned this pull request Jan 22, 2018

Tweak memory limits for out_kafka example #18

Merged

This was referenced Feb 6, 2018

Set librdkafka config for disconnect error and producer buffers #19

Merged

CPU and Memory Requests and Limits? #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add out_kafka example #16

Add out_kafka example #16

solsson commented Jan 16, 2018

StevenACoffman commented Jan 17, 2018

solsson commented Jan 17, 2018 •

edited

StevenACoffman commented Jan 17, 2018

solsson commented Jan 17, 2018

solsson commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

edsiper commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

solsson commented Jan 22, 2018

edsiper commented Jan 22, 2018

solsson commented Jan 22, 2018 •

edited

solsson commented Jan 23, 2018

edsiper commented Jan 29, 2018

Add out_kafka example #16

Add out_kafka example #16

Conversation

solsson commented Jan 16, 2018

StevenACoffman commented Jan 17, 2018

solsson commented Jan 17, 2018 • edited

StevenACoffman commented Jan 17, 2018

solsson commented Jan 17, 2018

solsson commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

edsiper commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

StevenACoffman commented Jan 22, 2018

solsson commented Jan 22, 2018

edsiper commented Jan 22, 2018

solsson commented Jan 22, 2018 • edited

solsson commented Jan 23, 2018

edsiper commented Jan 29, 2018

solsson commented Jan 17, 2018 •

edited

solsson commented Jan 22, 2018 •

edited