Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry #3509

Closed
g3kr opened this issue Sep 15, 2021 · 5 comments
Labels
stale waiting-for-user Similar to "moreinfo", but especially need feedback from user

Comments

@g3kr
Copy link

g3kr commented Sep 15, 2021

Describe the bug

We are using in_monitor_agent to monitor the metrics from fluentd. Based on the emitted metrics we have alerts being sent out. We observed that the retry_count metric and the slow_flush_count metric does not reset to zero when things fall back in place. Unless you restart the fluentd process/task these numbers keep incrementing.

To Reproduce

Run fluentd with the below config and force retry to happen by sending large number of logs to Fluentd. query the retry_count metric and observe that after successful retry the count has not been reset

Expected behavior

retry_count and slow_flush_count set back to 0 after successful retry

Your Environment

- Fluentd version: fluentd -v 1.12.4
- Environment: Docker running on Amazon Linux 2

Your Configuration

<source>
    @type monitor_agent
    @id in_monitor_agent
    @log_level info
    @label @INTERNALMETRICS
    tag "monitor.#{ENV['TaskID']}"
    emit_interval 60
  </source>

<label @INTERNALMETRICS>
    <filter monitor.**>
      @type record_modifier
      <record>
        TaskID "#{ENV['TaskID']}"
        ECS_CLUSTER "#{ENV['ECS_CLUSTER_NAME']}"
        @timestamp ${require 'time'; Time.at(time).strftime('%Y-%m-%dT%H:%M:%S.%3N')}
      </record>
    </filter>
    <match monitor.**>
        @type copy
        <store>
          @type stdout
        </store>
        <store>
          @type elasticsearch
          host "#{ENV['ES_HOSTNAME']}"
          port 9243
          user "#{ENV['ES_USERNAME']}"
          password "#{ENV['ES_PASSWORD']}"
          scheme https
          with_transporter_log true
          ssl_verify false
          ssl_version TLSv1_2
          index_name "#{ENV['ES_index']}"
          reconnect_on_error true
          reload_connections false
          reload_on_failure true
          suppress_type_name true
          request_timeout 30s
          prefer_oj_serializer true
          type_name _doc
          </store>
    </match>
  </label>

Your Error Log

{
        "_index" : "agg-metrics",
        "_type" : "_doc",
        "_id" : "WqrN6nsBy9uSnPxiS8mH",
        "_score" : 0.0,
        "_source" : {
          "plugin_id" : "es_output",
          "plugin_category" : "output",
          "type" : "elasticsearch",
          "output_plugin" : true,
          "buffer_queue_length" : 0,
          "buffer_timekeys" : [ ],
          "buffer_total_queued_size" : -135483,
          "retry_count" : 76,
          "emit_records" : 1614787,
          "emit_count" : 416410,
          "write_count" : 66858,
          "rollback_count" : 76,
          "slow_flush_count" : 39,
          "flush_time_count" : 40485392,
          "buffer_stage_length" : 1,
          "buffer_stage_byte_size" : 19338,
          "buffer_queue_byte_size" : -154821,
          "buffer_available_buffer_space_ratios" : 100.0,
          "TaskID" : "0da43d9abf1d492dbae9bb14c5bdqazx",
          "ECS_CLUSTER" : "aggregator-service-ECSCluster",
          "@timestamp" : "2021-09-15T18:51:04.842"
        }

Additional context

No response

@g3kr g3kr changed the title in_monitor_agent: retry_count not resetting to zero after successful retry in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry Sep 15, 2021
@g3kr
Copy link
Author

g3kr commented Sep 15, 2021

@repeatedly any observation/thoughts on this?

@cosmo0920
Copy link
Contributor

cosmo0920 commented Sep 16, 2021

This should be counter which is cumulative counter not gauge which is reset-able counter metrics. Not resetting them is expected behavior.

@cosmo0920 cosmo0920 added the waiting-for-user Similar to "moreinfo", but especially need feedback from user label Sep 16, 2021
@g3kr
Copy link
Author

g3kr commented Sep 16, 2021

@cosmo0920 Thanks for getting back on this. In that case, is there a metric we can use for alerting for anomalies?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Dec 16, 2021
@kenhys
Copy link
Contributor

kenhys commented Dec 17, 2021

@g3kr

Maybe count of steps will help.

https://docs.fluentd.org/input/monitor_agent#retry

@kenhys kenhys closed this as completed Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale waiting-for-user Similar to "moreinfo", but especially need feedback from user
Projects
None yet
Development

No branches or pull requests

3 participants