in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

g3kr · 2021-09-15T18:56:05Z

Describe the bug

We are using in_monitor_agent to monitor the metrics from fluentd. Based on the emitted metrics we have alerts being sent out. We observed that the retry_count metric and the slow_flush_count metric does not reset to zero when things fall back in place. Unless you restart the fluentd process/task these numbers keep incrementing.

To Reproduce

Run fluentd with the below config and force retry to happen by sending large number of logs to Fluentd. query the retry_count metric and observe that after successful retry the count has not been reset

Expected behavior

retry_count and slow_flush_count set back to 0 after successful retry

Your Environment

- Fluentd version: fluentd -v 1.12.4
- Environment: Docker running on Amazon Linux 2

Your Configuration

<source>
    @type monitor_agent
    @id in_monitor_agent
    @log_level info
    @label @INTERNALMETRICS
    tag "monitor.#{ENV['TaskID']}"
    emit_interval 60
  </source>

<label @INTERNALMETRICS>
    <filter monitor.**>
      @type record_modifier
      <record>
        TaskID "#{ENV['TaskID']}"
        ECS_CLUSTER "#{ENV['ECS_CLUSTER_NAME']}"
        @timestamp ${require 'time'; Time.at(time).strftime('%Y-%m-%dT%H:%M:%S.%3N')}
      </record>
    </filter>
    <match monitor.**>
        @type copy
        <store>
          @type stdout
        </store>
        <store>
          @type elasticsearch
          host "#{ENV['ES_HOSTNAME']}"
          port 9243
          user "#{ENV['ES_USERNAME']}"
          password "#{ENV['ES_PASSWORD']}"
          scheme https
          with_transporter_log true
          ssl_verify false
          ssl_version TLSv1_2
          index_name "#{ENV['ES_index']}"
          reconnect_on_error true
          reload_connections false
          reload_on_failure true
          suppress_type_name true
          request_timeout 30s
          prefer_oj_serializer true
          type_name _doc
          </store>
    </match>
  </label>

Your Error Log

{
        "_index" : "agg-metrics",
        "_type" : "_doc",
        "_id" : "WqrN6nsBy9uSnPxiS8mH",
        "_score" : 0.0,
        "_source" : {
          "plugin_id" : "es_output",
          "plugin_category" : "output",
          "type" : "elasticsearch",
          "output_plugin" : true,
          "buffer_queue_length" : 0,
          "buffer_timekeys" : [ ],
          "buffer_total_queued_size" : -135483,
          "retry_count" : 76,
          "emit_records" : 1614787,
          "emit_count" : 416410,
          "write_count" : 66858,
          "rollback_count" : 76,
          "slow_flush_count" : 39,
          "flush_time_count" : 40485392,
          "buffer_stage_length" : 1,
          "buffer_stage_byte_size" : 19338,
          "buffer_queue_byte_size" : -154821,
          "buffer_available_buffer_space_ratios" : 100.0,
          "TaskID" : "0da43d9abf1d492dbae9bb14c5bdqazx",
          "ECS_CLUSTER" : "aggregator-service-ECSCluster",
          "@timestamp" : "2021-09-15T18:51:04.842"
        }

Additional context

No response

The text was updated successfully, but these errors were encountered:

g3kr · 2021-09-15T21:38:23Z

@repeatedly any observation/thoughts on this?

cosmo0920 · 2021-09-16T02:28:42Z

This should be counter which is cumulative counter not gauge which is reset-able counter metrics. Not resetting them is expected behavior.

g3kr · 2021-09-16T13:36:14Z

@cosmo0920 Thanks for getting back on this. In that case, is there a metric we can use for alerting for anomalies?

github-actions · 2021-12-16T10:02:14Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

kenhys · 2021-12-17T06:59:03Z

@g3kr

Maybe count of steps will help.

https://docs.fluentd.org/input/monitor_agent#retry

g3kr changed the title ~~in_monitor_agent: retry_count not resetting to zero after successful retry~~ in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry Sep 15, 2021

cosmo0920 added the waiting-for-user Similar to "moreinfo", but especially need feedback from user label Sep 16, 2021

github-actions bot added the stale label Dec 16, 2021

kenhys closed this as completed Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

g3kr commented Sep 15, 2021 •

edited

Loading

g3kr commented Sep 15, 2021

cosmo0920 commented Sep 16, 2021 •

edited

Loading

g3kr commented Sep 16, 2021

github-actions bot commented Dec 16, 2021

kenhys commented Dec 17, 2021

in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

Comments

g3kr commented Sep 15, 2021 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

g3kr commented Sep 15, 2021

cosmo0920 commented Sep 16, 2021 • edited Loading

g3kr commented Sep 16, 2021

github-actions bot commented Dec 16, 2021

kenhys commented Dec 17, 2021

g3kr commented Sep 15, 2021 •

edited

Loading

cosmo0920 commented Sep 16, 2021 •

edited

Loading