memory usage seems to be growing indefinitely with simple forward configuration #2926

sp0f · 2020-04-02T06:20:16Z

Describe the bug
Memory usage is growing indefinitely with very simple @type forward configuration. I'm shipping about 10-30G of logs daily. Mostly during evening hours. td-agent at the beginning is using about 100M of RES memory but after about 12h it's 500M and half of this time it's mostly idled because there are not many logs during the night. Memory is never freed. When I switch to any 'local' output like file or stdout memory usage is very stable. I've seen the same behavior with elasticsearch output so I guess it can be something connected with network outputs ... or just my stupidity :) There are no errors in the log file. I've tried to set RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 but it didn't fix the problem. Memory usage is still growing. I have two workers in my configuration This problem only occurs on the one with a high amount of logs generated during day hours (worker 0). A worker is reading about 50 files at once. It may be relevant that I have a lot of pattern not matched - I'm in the middle of standardizing log format for all apps.

To Reproduce
Run with a high amount of not perfectly formate logs.

Expected behavior
Stable memory usage.

Your Environment

Fluentd or td-agent version: td-agent 1.9.2
Operating system: NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/"
Kernel version: 4.14.171-136.231.amzn2.x86_64

Your Configuration

<system>
  workers 2
  log_level warn
</system>

<worker 0>
  <source>
    @type tail
    path "/somepath/*/*.log"
    read_from_head true
    pos_file /var/log/td-agent/mpservice-es-pos-file
    tag mpservice-raw.*
    enable_stat_watcher false
    <parse>
      @type multiline
      time_key time
      time_format %Y-%m-%d %H:%M:%S,%L
      timeout 0.5
      format_firstline /^\d{4}-\d{2}-\d{2}/
      format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \| (?<level>.*) \| (?<class>.*) \| (?<thread>.*) \| (?<message>.*)/

    </parse>
  </source>

<match *.**>
  @type forward
  <server>
    host ***
    port 24224
  </server>
  <buffer>
    @type memory
    flush_interval 2s
    flush_thread_count 2
  </buffer>
</match>

</worker>

# webapps ES
<worker 1>
  <source>
    @type tail
    path "/somepath/*/*[a-z].log"
    read_from_head true
    pos_file /var/log/td-agent/webapps-es-pos-file
    tag webapps-raw.*
    enable_stat_watcher false
    <parse>
      @type multiline
      format_firstline /^\d{4}-\d{2}-\d{2}/
      format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \| (?<level>.*) \| (?<class>.*) \| (?<thread>.*) \| (?<message>.*)/
    </parse>
  </source>


<match *.**>
  @type forward
  <server>
    host ***
    port 24224
  </server>
  <buffer>
    @type memory
    flush_interval 2s
    flush_thread_count 2
  </buffer>
</match>
</worker>

Your Error Log

no errors in logs

The text was updated successfully, but these errors were encountered:

aeber · 2020-04-06T08:46:28Z

I am currently investigating a similar issue.

My setup:

td-agent version: 1.9.2 / 1.10.0
Operating system: Debian Buster / Debian Stretch
Kernel version: 4.19.0-8-amd64 / 4.9.0-12-amd64

I'm using the forward output with a file buffer to forward logs to an aggregator node:

    <buffer>
      chunk_limit_size 8m
      flush_at_shutdown true
      flush_interval 5s
      path /var/log/td-agent/buffer/forward.*.buf
      total_limit_size 128m
      @type file
    </buffer>

To reproduce the issue i created a dummy input with size=1000, rate=150 and dropped the received dummy messages on the aggregator node without doing any processing.

By lowering the chunk_limit_size to 1m and the flush_interval to 1s i was able to increase memory usage by 0.5MB each second.
Setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 did not change this behaviour.

aeber · 2020-04-07T12:34:42Z

I just did two more tests:

Null output with buffer

Using the null output with a file buffer does not lead to an increasing memory consumption as seen with the forward output.

Affected fluentd versions

I could reproduce this issue with each fluentd version down to 1.7.0. Fluentd
1.6.3 seems to be the last version not affected by this behaviour.

repeatedly · 2020-04-07T16:14:36Z

Thanks. We will check it.

aeber · 2020-04-07T16:20:18Z

I managed to narrow it down even further: The problem first appeared after merging #2501.

Somehow the SocketCache does not seem to really cache sockets and creates a new one every few seconds.

ganmacs · 2020-04-08T02:26:24Z

@aeber Can you give us the actual config to reproduce it?

Besides, I'd like to ask you how long you ran fluentd?
I tried to reproduce it with the following configs. I was able to see an increasing memory. but after a few minutes, the increase in memory stopped and stabilized.

Am I missing important things or it depends on OSs?

Sender(fluentd v1.10.0 and macOS(10.15.3))

<source>
  @type dummy
  tag test
  rate 150
  size 1000
</source>

<match test>
  @type forward

  <server>
    host 127.0.0.1
    port 24224
  </server>

  <buffer>
    @type file
    chunk_limit_size 1m
    flush_at_shutdown true
    flush_interval 1s
    path ./tmp/buf
    total_limit_size 128m
  </buffer>
</match>

Receiver(fluentd v1.10.0 and macOS(10.15.3))

<source>
  @type forward
</source>

<match test>
  @type null

  <buffer>
    flush_interval 1s
    chunk_limit_size 5m
  </buffer>
</match>

aeber · 2020-04-09T15:42:49Z

The problem only seems to appear when using the tls transport.

The configs i'm using for reproducing this:

Sender (docker image fluentd:v1.9-1)

<source>
  @type dummy
  size 1200
  rate 450
  tag test
</source>

<match **>
  @type forward
  <buffer>
    chunk_limit_size 1m
    flush_at_shutdown true
    flush_interval 1s
    path /fluentd/buffer/forward_graylog.*.buf
    total_limit_size 128m
    @type file
  </buffer>
  heartbeat_interval 1s
  keepalive true
  phi_threshold 16
  recover_wait 10s
  <security>
    self_hostname client
    shared_key sharedkeypassword
  </security>
  send_timeout 60s
  <server>
    host server
    password testpassword
    port 24284
    username testuser
  </server>
  tls_verify_hostname true
  tls_cert_path /fluentd/certs/rootCA.pem
  tls_insecure_mode false
  transport tls
</match>

Receiver (docker image fluentd:v1.9-1)

<source>
  @type forward
  bind 0.0.0.0
  port 24284
  <security>
    self_hostname server
    shared_key sharedkeypassword
    <user>
      password testpassword
      username testuser
    </user>
    user_auth true
  </security>
  <transport tls>
    cert_path /fluentd/certs/server.pem
    insecure false
    private_key_passphrase
    private_key_path /fluentd/certs/server-key.pem
  </transport>
</source>

<match test>
  @type null
</match>

For testing this i created a local CA and server certificate with
mkcert and then used the following
docker-compose.yml:

version: '2'
services:
  client:
    image: fluentd:v1.9-1
    command:
      - "-v"
    volumes:
      - "./client/fluent.conf:/fluentd/etc/fluent.conf"
      - "./client/certs:/fluentd/certs"
  server:
    image: fluentd:v1.9-1
    command:
      - "-v"
    volumes:
      - "./server/fluent.conf:/fluentd/etc/fluent.conf"
      - "./server/certs:/fluentd/certs"

After running for about 20minutes on an Macbook i see a memory usage of around 200MB for the client container and rising.
When i run this on a linux server it only takes around 5-10 minutes to reach the same amount of memory usage (not sure why).

4ndr4s · 2020-04-12T17:02:42Z

I have a similar issue, after updating fluentd container from 0.12.26 to 1.9.3 version under alpine flavor, memory started growing some time after the container started running and never stopped until OOM.

this wasn't happening with the same configuration in old versions I can confirm @aeber comment.

configuration:
fluentd version: 1.9.3

fluentd.conf:

  @type syslog
  port 514
  <transport udp>
  </transport>
  tag  system
  @log_level info
</source>

<source>
  @type syslog
  port 514
  <transport tcp>
  </transport>
  tag  system
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type null
  </match>
</label>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  @type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/service_logs
    <buffer time>
      timekey 86400
      timekey_use_utc true
    </buffer>
    append true
  </store>
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>```

repeatedly · 2020-04-13T01:57:12Z

@4ndr4s Does this happen with elasticsearch and file buffer combo?
Your graph shows it is not memory leak. Issue author and you use memory buffer.
So if incoming speed is faster than output spped, memory usage is growing.

4ndr4s · 2020-04-13T04:54:32Z

@repeatedly yes, this happen with elasticsearch and buffer combo configuration, not sure if the buffer configuration can influence in this behavior, the only change besides the version I did was the buffer block according with fluentd documentation, this was the previous configuration.

    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
    utc
  </store>

4ndr4s · 2020-04-13T05:03:57Z

@repeatedly thinking twice, not sure if that applied to elasticsearch too, I'll enable logging and let you know, and test elasticsearch config alone.

aeber · 2020-04-14T07:48:17Z

I can confirm, that a current build from master @ cdf75b0 no longer leaks memory with the given configuration. Currently running for 30minutes and memory usage is stable at ~70MB.

Is there already a timeline for a fluentd / td-agent release covering this fix?

4ndr4s · 2020-04-14T13:24:46Z

@repeatedly I already remove the buffer block, and go back to the previous configuration and the behavior is the same memory start growing until OOM, this wasn't happening in 0.12.26 version with the same configuration and same incoming data.


<source>
  @type syslog
  port 514
  <transport udp>
  </transport>
  tag  system
  @log_level info
</source>

<source>
  @type syslog
  port 514
  <transport tcp>
  </transport>
  tag  system
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type null
  </match>
</label>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  @type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
  </store>
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>

4ndr4s · 2020-04-15T13:23:14Z

@repeatedly I removed the @ type file block and keep only the elasticsearch block and stdout, the issue is still happening, memory usage continue growing until OOM.

  @type copy
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

repeatedly · 2020-04-15T15:13:55Z

@aeber v1.10.2 has been released.

@4ndr4s Your comment is unclear. Do you use v0.12 with file buffer, right?

4ndr4s · 2020-04-15T15:37:45Z

@repeatedly I use v0.12 without buffer and with the follow configuration, using 3 outputs plugins.
I already made those changes using the 1.9.3 version:

use the old configuration.
use memory buffer with file output plugin.
use file buffer with with file output plugin.
remove file output plugin

I got OOM with all of those configurations.
I already build a new container with 1.10.2 version, will let you know if that can help with the current issue, need to wait around 24 hours.

fluentd v0.12 configuration >

  type syslog
  port 514
  protocol_type udp
  tag  system
  @log_level info
</source>

<source>
  type syslog
  port 514
  protocol_type tcp
  tag  system
</source>

<match fluent.**>
  @type null
</match>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
    utc
  </store>
  <store>
    type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>

ganmacs · 2020-04-16T03:13:17Z

@4ndr4s

I already made those changes using the 1.9.3 version:

Can you give us the complete config, then?

By the way, Can you re-create a new issue? This issue is not related to your issue. I'll close this issue later.

4ndr4s · 2020-04-16T04:47:02Z

@ganmacs thanks for your answer, I just open a new issue with all details. #2955

douglasawh · 2022-05-10T21:06:11Z

I am a bit confused reading this. It looks like #2945 is confirmed to fix the issue, and yet it remains open. Is this still an issue?

ashie · 2022-05-11T04:59:11Z

I'm not sure but I think there are 3 reports in this issue:

report1: @sp0f memory usage seems to be growing indefinitely with simple forward configuration #2926 (comment)
report2: @aeber memory usage seems to be growing indefinitely with simple forward configuration #2926 (comment)
report3: memory usage seems to be growing indefinitely with simple forward configuration #2926 (comment)

Although above 3 reports are similar, probably all of their causes are different.

report1: Fixed by Creating timer of purging obsolete sockets if keepalive_timeout is nil #2943 (not confirmed by the original reporter though)
~~report2: Fixed by Need authentication when sending tcp heartbeat and keepalive in out_forward #2945~~
- It's not correct, probably same issue with report1.
report3: Moved to memory usage seems to be growing indefinitely until OOM #2955 and closed by the original reporter (I don't know the detail though)

The issue of the original report is fixed, so we should close this issue.

BTW recently I investigated a similar issue with report2, this is the why I left this opened:
fluent/fluent-package-builder#374
But now I've confirmed that it's also caused by a different root cause.

repeatedly added the bug label Apr 7, 2020

ganmacs self-assigned this Apr 10, 2020

ganmacs mentioned this issue Apr 10, 2020

Creating timer of purging obsolete sockets if keepalive_timeout is nil #2943

Merged

ganmacs mentioned this issue Apr 13, 2020

Need authentication when sending tcp heartbeat and keepalive in out_forward #2945

Merged

ganmacs removed their assignment Apr 14, 2020

4ndr4s mentioned this issue Apr 16, 2020

memory usage seems to be growing indefinitely until OOM #2955

Closed

ashie added the memory label Mar 15, 2022

ashie closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory usage seems to be growing indefinitely with simple forward configuration #2926

memory usage seems to be growing indefinitely with simple forward configuration #2926

sp0f commented Apr 2, 2020 •

edited

aeber commented Apr 6, 2020

aeber commented Apr 7, 2020

repeatedly commented Apr 7, 2020

aeber commented Apr 7, 2020

ganmacs commented Apr 8, 2020 •

edited

aeber commented Apr 9, 2020

4ndr4s commented Apr 12, 2020 •

edited

repeatedly commented Apr 13, 2020

4ndr4s commented Apr 13, 2020 •

edited

4ndr4s commented Apr 13, 2020

aeber commented Apr 14, 2020 •

edited

4ndr4s commented Apr 14, 2020

4ndr4s commented Apr 15, 2020 •

edited

repeatedly commented Apr 15, 2020

4ndr4s commented Apr 15, 2020

ganmacs commented Apr 16, 2020

4ndr4s commented Apr 16, 2020

douglasawh commented May 10, 2022

ashie commented May 11, 2022 •

edited

memory usage seems to be growing indefinitely with simple forward configuration #2926

memory usage seems to be growing indefinitely with simple forward configuration #2926

Comments

sp0f commented Apr 2, 2020 • edited

aeber commented Apr 6, 2020

aeber commented Apr 7, 2020

Null output with buffer

Affected fluentd versions

repeatedly commented Apr 7, 2020

aeber commented Apr 7, 2020

ganmacs commented Apr 8, 2020 • edited

aeber commented Apr 9, 2020

4ndr4s commented Apr 12, 2020 • edited

repeatedly commented Apr 13, 2020

4ndr4s commented Apr 13, 2020 • edited

4ndr4s commented Apr 13, 2020

aeber commented Apr 14, 2020 • edited

4ndr4s commented Apr 14, 2020

4ndr4s commented Apr 15, 2020 • edited

repeatedly commented Apr 15, 2020

4ndr4s commented Apr 15, 2020

ganmacs commented Apr 16, 2020

4ndr4s commented Apr 16, 2020

douglasawh commented May 10, 2022

ashie commented May 11, 2022 • edited

sp0f commented Apr 2, 2020 •

edited

ganmacs commented Apr 8, 2020 •

edited

4ndr4s commented Apr 12, 2020 •

edited

4ndr4s commented Apr 13, 2020 •

edited

aeber commented Apr 14, 2020 •

edited

4ndr4s commented Apr 15, 2020 •

edited

ashie commented May 11, 2022 •

edited