Skip to content

fluent-bit fails to forward logs with 'no upstream connections available to <endpoint>' #7086

@makas45

Description

@makas45

Bug Report

Describe the bug
We are forwarding the logs to splunk and we found that time to time getting the below error in fluentbit pods.
Fluentbit pods -
[2023/03/28 12:37:55] [error] [net] TCP connection failed: splunk-fluentd.monitoring.svc.cluster.local:24240 (Connection refused)
[2023/03/28 12:37:55] [error] [output:forward:forward.0] no upstream connections available
[2023/03/28 12:37:55] [ warn] [engine] failed to flush chunk '1-1680007074.805347894.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=forward.0 (out_id=0)
[2023/03/28 12:37:55] [error] [net] TCP connection failed: splunk-fluentd.monitoring.svc.cluster.local:24240 (Connection refused)
[2023/03/28 12:37:55] [error] [output:forward:forward.0] no upstream connections available
[2023/03/28 12:37:55] [ warn] [engine] failed to flush chunk '1-1680007074.856843682.flb', retry in 6 seconds: task_id=2, input=tail.0 > output=forward.0 (out_id=0)
[2023/03/28 12:37:56] [error] [net] TCP connection failed: splunk-fluentd.monitoring.svc.cluster.local:24240 (Connection refused)
[2023/03/28 12:37:56] [error] [output:forward:forward.0] no upstream connections available
[2023/03/28 12:37:56] [error] [net] TCP connection failed: splunk-fluentd.monitoring.svc.cluster.local:24240 (Connection refused)
[2023/03/28 12:37:56] [error] [output:forward:forward.0] no upstream connections available

fluentd pods also having some error -

2023-03-29 09:52:08 +0000 [warn]: #0 [flow:outputflowname] failed to flush the buffer. retry_times=5 next_retry_time=2023-03-29 09:52:40 +0000 chunk="5f806e32832e16de5cdf89bf8714d9e6" error_class=RuntimeError error="Server error (502) for POST https://splunkendpoint/services/collector, response: \n\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n\n\n

Error: Server Error

\n

The server encountered a temporary error and could not complete your request.

Please try again in 30 seconds.

\n

\n\n"

2023-03-29 09:52:25 +0000 [warn]: #0 [flow:outputflowname] failed to flush the buffer. retry_times=0 next_retry_time=2023-03-29 09:52:26 +0000 chunk="5f806eaa0c2dec2aa728034f9cc4b3d0" error_class=RuntimeError error="Server error (502) for POST https://splunkendpoint/services/collector, response: \n\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n\n\n

Error: Server Error

\n

The server encountered a temporary error and could not complete your request.

Please try again in 30 seconds.

\n

\n\n"

2023-03-29 09:52:26 +0000 [warn]: #0 [flow:outputflowname] failed to flush the buffer. retry_times=6 next_retry_time=2023-03-29 09:53:30 +0000 chunk="5f806e1c8fce315d338f5e689b8a0e03" error_class=RuntimeError error="Server error (502) for POST https://splunkendpoint/services/collector, response: \n\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n\n\n

Error: Server Error

\n

The server encountered a temporary error and could not complete your request.

Please try again in 30 seconds.

\n

\n\n"

Fluentbit config:

[SERVICE]
Flush 1
Grace 5
Daemon Off
Log_Level warning
Parsers_File parsers.conf
Coro_Stack_Size 24576
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /buffers

[INPUT]
Name tail
DB /tail-db/tail-containers-state.db
DB.locking true
Exclude_Path kube-system,cnrm-system,monitoring,bats-test,management-system,argocd,managed-operators,configconnector-operator-system
Mem_Buf_Limit 128MB
Parser cri
Path /var/log/containers/.log
Refresh_Interval 5
Skip_Long_Lines On
Tag kubernetes.

[FILTER]
Name kubernetes
Buffer_Size 0
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Tag_Prefix kubernetes.var.log.containers
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_URL https://kubernetes.default.svc:443
Match kubernetes.*
Merge_Log On
Use_Kubelet Off

[OUTPUT]
Name forward
Match *
Host splunk-fluentd.monitoring.svc.cluster.local
Port 24240

net.keepalive on
net.keepalive_idle_timeout 30
net.keepalive_max_recycle 100
Retry_Limit  50

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
{"log":"YOUR LOG MESSAGE HERE","stream":"stdout","time":"2018-06-11T14:37:30.681701731Z"}
  • Steps to reproduce the problem:

Expected behavior
We would like to fix the error which capture those pods .

Screenshots

Your Environment
All the environment
*** Version used:**
- name: logging-operator
# repository: https://kubernetes-charts.banzaicloud.com
version: 3.17.9
fluent/fluent-bit:1.9.5
fluentd:v1.14.6-alpine-5

  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions