-
Notifications
You must be signed in to change notification settings - Fork 248
Unstable using latest from 0.13-dev branch #17
Comments
do you know if is there any best practice about using a DNS name to reach the API server ? filter_kubernetes by default uses kubernetes.default.svc, but what about kubernetes.default.svc.cluster.local ? (cc: @solsson) |
I've seen a lot of both |
thanks for the feedback, I will go ahead and change that. |
@StevenACoffman is this still an issue ? |
@edsiper You mentioned a lack of "check" in out_kafka in #18 The instability I'm seeing seems completely attributable due to the memory (and cpu?) limits for nodes with lots of pre-existing logs. From @solsson on Jan 26, 2018:
From @leahnp on May 17, 2017 0:6
Copied from original issue: samsung-cnct/kraken-logging-fluent-bit-daemonset#5 and Moved to samsung-cnct/chart-fluent-bit#9 |
@edsiper I think you can merge #18 as further increases in limit would make no difference. Only caps to buffer sizes will. What's the effect of Mem_Buf_Limit on the input plugin at start? Desired behavior of Tail would be that parsing stops temporarily. According to http://fluentbit.io/documentation/0.12/configuration/backpressure.html#membuflimit it can be set on output plugins too, but am I correct to interpret your earlier remarks as this having no effect because the kafka client does the buffering? Maybe fluent/fluent-bit#495 can help for a cap there, through |
@solsson merged, thanks. Mem_Buf_Limit only applies for input plugins to pause data ingestion into the engine. Since out_kafka buffer the data (not delivery), Fluent Bit issue a "OK", so in_tail keeps ingesting data. The fix is to add out_kafka logic to real check if a message was delivered. If you see memory grow with a different output plugin, there is definitely something wrong, I will double check the code anyways |
I'm only running out_kafka. I will try out_kafka with |
Hmm... I pulled from the latest, removed the cpu and memory limits, and I'm getting some CrashLoopBackOff Pods terminated with exitCode 139, which I think is a Segmentation Fault (SIGSEGV 11). No termination message. This is not on a node with excessive existing logs.
Earlier in fluent-bit pod's output, I am getting a lot of:
I cannot seem to have the pod on that node come up healthy, regardless of restarts or terminate and re-create attempts, but the rest do. |
I altered the configmap and changed the
Applying the change, deleting the pod, the daemonset recreated the pod and it came up healthy after several hours of other failed attempts. |
FYI: 0.13-dev:0.7 is out: https://github.com/fluent/fluent-bit-kubernetes-logging/tree/0.13-dev |
FYI: 0.13-dev:0.9 is out: https://github.com/fluent/fluent-bit-kubernetes-logging/tree/0.13-dev |
I've upgraded and it looks good to me. |
0.13-dev:0.9 is very solid so far (20 hours, large volume). |
I am experiencing a lot of instability when applying the latest changes from 0.13-dev branch, specifically #16
Eventually if a pod crashes on a busy nodes and enters CrashLoopBackOff, it won't ever recover. I am still investigating, but if you can see anything obvious, I would really appreciate your insight.
At first, I thought it was the memory and or cpu limits, so I removed those and crashes happen much less reliably. Without limits, I'm still seeing what looks like multiple failure reasons. I changed the namespace (to
kangaroo
) and kafka topic (tok8s-firehose
), and I changed theLog_Level
todebug
. In the Kube_URL,kubernetes.default.svc
got a fewTemporary failure in name resolution
errors so I changed it tokubernetes.default.svc.cluster.local
and it have not seen it again.I am using kail to follow all the daemonset pods in parallel, but that's quite chatty, so I do filter it down to errors with some context using:
kail --ds=fluent-bit | grep -A 10 -B 10 error
The output I get is:
The text was updated successfully, but these errors were encountered: