Skip to content

Handling logs of huge volume with fluent-bit/fluentd #3646

@g3kr

Description

@g3kr

Describe the bug

We are often challenged with huge influx of logs from log collectors(fluent-bit) of certain apps running on ECS which causes the log aggregator running on fluentd in ECS to restart and eventually make our data storage and analytics area elasticsearch unstable. We incorporated a few ways to alleviate this.

  1. Introduce throttling at the aggregator level (drop messages when it goes beyond a threshold) - no manual intervention

  2. match log messages of the specific app and drop them before the aggregator does any processing (aggregator needs to be redeployed with the filter - needs manual intervention)

  3. Introduce throttling at the collector level (throttle filter plugin) (still need to be tested)

While option 1 works when the log volume is reasonable, it has not been proven to be useful when the log volume is huge (3million records in 1hr). The log aggregator container keeps restarting which resets the throttle limits with every restart

With option 2 we have tried to successfully resolve the solution - but it needs an operator to restart the log aggregator with the specific app filter whose logs we want to drop. We normally do an update to the AWS CF stack to achieve this.

To Reproduce

Run an application generating millions of records within few minutes writing to fluentd

Expected behavior

N/A

Your Environment

- Fluentd version: 1.14.2
- TD Agent version: N/A
- Operating system: Amazon Fargate Linux 2
- Kernel version:

Your Configuration

Current fluentd config - APP_LOGS_DROP will be need to be set to the App that creates a huge influx of logs and the aggregator container is restarted

<match "#{ENV['APP_LOGS_DROP']}">
      @type null
   </match>
   <match **>
     @type relabel
     @label @throttle
   </match>
</label>
<label @throttle>
  <filter log.**>
    @type record_modifier
    <record>
      app ${tag_parts[1]}
    </record>
  </filter>
  <filter log.**>
    @type throttle
    group_key app
    group_bucket_period_s   "#{ENV['THROTTLE_PERIOD']}"
    group_bucket_limit      "#{ENV['THROTTLE_LIMIT']}"
    group_reset_rate_s      "#{ENV['THROTTLE_RESET_RATE']}"
  </filter>
  <match log.**>
   @type relabel
   @label @continue
  </match>

Your Error Log

N/A

Additional context

I want to know if there are other ways to approach this issue and also ways to automate option 2.
Currently the way we get to know about the huge log volume is through watcher alerts in Elasticsearch when it becomes unstable.
Is there a way to inject like a conditional code in fluentd config which aggregates the counts of number of records received for a particular app(tag) in a given period of time and drop them based on a condition. Basically looking for ways to avoid manual intervention. Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions