Skip to content

Prometheus metric fluentd_output_status_buffer_total_bytes is unreliable #5232

@JoergStrebel

Description

@JoergStrebel

Describe the bug

I have fluentd setup which publishes metrics to prometheus server. I have an HTTP_OUT plugin enabled as my http output. I am using a file buffer in the http output plugin.

The Prometheus metric fluentd_output_status_buffer_total_bytes for this output plugin is pretty unreliable - it is supposed to show the size of the file buffer, but if you compare the values with the actual contents of the filesystem folder used for the buffer, it might just not match in some cases. I suppose there is some concurrency issue in the way fluentd calculates the values.

I had a case where a simple fluentd Pod restart instantly eliminated a reported buffer size of 528MB! This buffer was never that big to begin with on disk.

To Reproduce

I configured fluentd like this:

<store>
          @type http
          endpoint https://...
          json_array true   #Ingestion service expects a JSON array
          open_timeout 90
          read_timeout 90
....

Using these settings I could observe the wrong metrics. When I went back to a 60s timeout, the metrics were correct. (60s were the default timeout on the server side). So I assume that fluentd does not handle well the case when the http server closes the connection on its side, while the http output plugin is still holding on.

Expected behavior

the http output plugin should check for http connection status and better cover the case when a connection is unexpectedly closed.

Your Environment

- Fluentd version: v1.18-debian-elasticsearch7-1
- Docker image: fluent/fluentd-kubernetes-daemonset:v1.18-debian-elasticsearch7-1

Your Configuration

<source>
      @type tail
      @id in_tail_container_logs
      @label @KUBERNETES
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag "#{ENV['FLUENT_CONTAINER_TAIL_TAG'] || 'kubernetes.*'}"
      read_from_head true
      follow_inodes true
      max_line_size 25000
      <parse>
        @type cri
        time_key time
        time_format %Y-%m-%dT%H:%M:%S.%L%z
        keep_time_key false
      </parse>
    </source>


    <label @NORMAL>
     
      <match **>
        @type copy  #copies all log lines to multiple <store> sections
                    
        # Send the logs to Elasticsearch
        <store>
          @type elasticsearch
          @id out_es2
          @log_level info
          ...
          </buffer>
        </store>        
        <store>
          @type http
          endpoint https://...
          json_array true 
          open_timeout 60 
          read_timeout 60
          <auth>
            method aws_sigv4
            aws_service osis
            aws_region eu-central-1
          </auth>
          <format>
            @type json
          </format>
          reuse_connections true          
          <buffer>
            @type file
            flush_mode interval
            path /var/log/fluentd/
            chunk_limit_size 11M 
            total_limit_size 15G
            flush_interval 1s
            flush_thread_count 10
            flush_at_shutdown true
            retry_max_interval 30
            retry_timeout 3600
            overflow_action drop_oldest_chunk            
          </buffer>
        </store>
      </match>
    </label>

Your Error Log

No reported errors in the log, which could be linked to this problem.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    waiting-for-userSimilar to "moreinfo", but especially need feedback from user

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions