-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
I have fluentd setup which publishes metrics to prometheus server. I have an HTTP_OUT plugin enabled as my http output. I am using a file buffer in the http output plugin.
The Prometheus metric fluentd_output_status_buffer_total_bytes for this output plugin is pretty unreliable - it is supposed to show the size of the file buffer, but if you compare the values with the actual contents of the filesystem folder used for the buffer, it might just not match in some cases. I suppose there is some concurrency issue in the way fluentd calculates the values.
I had a case where a simple fluentd Pod restart instantly eliminated a reported buffer size of 528MB! This buffer was never that big to begin with on disk.
To Reproduce
I configured fluentd like this:
<store>
@type http
endpoint https://...
json_array true #Ingestion service expects a JSON array
open_timeout 90
read_timeout 90
....
Using these settings I could observe the wrong metrics. When I went back to a 60s timeout, the metrics were correct. (60s were the default timeout on the server side). So I assume that fluentd does not handle well the case when the http server closes the connection on its side, while the http output plugin is still holding on.
Expected behavior
the http output plugin should check for http connection status and better cover the case when a connection is unexpectedly closed.
Your Environment
- Fluentd version: v1.18-debian-elasticsearch7-1
- Docker image: fluent/fluentd-kubernetes-daemonset:v1.18-debian-elasticsearch7-1Your Configuration
<source>
@type tail
@id in_tail_container_logs
@label @KUBERNETES
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag "#{ENV['FLUENT_CONTAINER_TAIL_TAG'] || 'kubernetes.*'}"
read_from_head true
follow_inodes true
max_line_size 25000
<parse>
@type cri
time_key time
time_format %Y-%m-%dT%H:%M:%S.%L%z
keep_time_key false
</parse>
</source>
<label @NORMAL>
<match **>
@type copy #copies all log lines to multiple <store> sections
# Send the logs to Elasticsearch
<store>
@type elasticsearch
@id out_es2
@log_level info
...
</buffer>
</store>
<store>
@type http
endpoint https://...
json_array true
open_timeout 60
read_timeout 60
<auth>
method aws_sigv4
aws_service osis
aws_region eu-central-1
</auth>
<format>
@type json
</format>
reuse_connections true
<buffer>
@type file
flush_mode interval
path /var/log/fluentd/
chunk_limit_size 11M
total_limit_size 15G
flush_interval 1s
flush_thread_count 10
flush_at_shutdown true
retry_max_interval 30
retry_timeout 3600
overflow_action drop_oldest_chunk
</buffer>
</store>
</match>
</label>Your Error Log
No reported errors in the log, which could be linked to this problem.Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status