Handling logs of huge volume with fluent-bit/fluentd

### Describe the bug

We are often challenged with huge influx of logs from log collectors(fluent-bit) of certain apps running on ECS which causes the log aggregator running on fluentd in ECS to restart and eventually make our data storage and analytics area elasticsearch unstable. We incorporated a few ways to alleviate this.

1. Introduce throttling at the aggregator level (drop messages when it goes beyond a threshold) - no manual intervention

2.  match log messages of the specific app and drop them before the aggregator does any processing (aggregator needs to be redeployed with the filter - needs manual intervention)

3.  Introduce throttling at the collector level (throttle filter plugin) (still need to be tested)

While option 1 works when the log volume is reasonable, it has not been proven to be useful when the log volume is huge (3million records in 1hr). The log aggregator container keeps restarting which resets the throttle limits with every restart

With option 2 we have tried to successfully resolve the solution - but it needs an operator to restart the log aggregator with the specific app filter whose logs we want to drop. We normally do an update to the AWS CF stack to achieve this.



### To Reproduce

Run an application generating millions of records within few minutes writing to fluentd

### Expected behavior

N/A

### Your Environment

```markdown
- Fluentd version: 1.14.2
- TD Agent version: N/A
- Operating system: Amazon Fargate Linux 2
- Kernel version:
```


### Your Configuration

```apache
Current fluentd config - APP_LOGS_DROP will be need to be set to the App that creates a huge influx of logs and the aggregator container is restarted

<match "#{ENV['APP_LOGS_DROP']}">
      @type null
   </match>
   <match **>
     @type relabel
     @label @throttle
   </match>
</label>
<label @throttle>
  <filter log.**>
    @type record_modifier
    <record>
      app ${tag_parts[1]}
    </record>
  </filter>
  <filter log.**>
    @type throttle
    group_key app
    group_bucket_period_s   "#{ENV['THROTTLE_PERIOD']}"
    group_bucket_limit      "#{ENV['THROTTLE_LIMIT']}"
    group_reset_rate_s      "#{ENV['THROTTLE_RESET_RATE']}"
  </filter>
  <match log.**>
   @type relabel
   @label @continue
  </match>
```


### Your Error Log

```shell
N/A
```


### Additional context

I want to know if there are other ways to approach this issue and also ways to automate option 2. 
Currently the way we get to know about the huge log volume is through watcher alerts in Elasticsearch when it becomes unstable. 
Is there a way to inject like a conditional code in fluentd config which aggregates the counts of number of records received for a particular app(tag) in a given period of time and drop them based on a condition. Basically looking for ways to avoid manual intervention. Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling logs of huge volume with fluent-bit/fluentd #3646

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Handling logs of huge volume with fluent-bit/fluentd #3646

Description

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions