Closed
Description
During some basic load testing, we found that the upgrade to Fluentd 0.14.13 from 0.12.32 caused buffer overflow errors to be triggered very frequently.
[Error]
2017-03-01 17:02:20 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
[Steps to reproduce]
- Create a GCP instance (1 vCPU, 3.75GB, Debian GNU/Linux 8)
- Install docker:
$ sudo apt-get install docker.io
- Create some test logs with the logs generator container:
$ mkdir ~/logs
$ sudo docker run -i -e "LOGS_GENERATOR_DURATION=1s" -e "LOGS_GENERATOR_LINES_TOTAL=1000000" gcr.io/google_containers/logs-generator:v0.1.0 2>&1 | awk '{print "{\"log\":\"" $0 "\"}"}' > ~/logs/log.log
- Create configs
$ mkdir ~/config.d
$ cat ~/config.d/config.conf
<match fluent.**>
type null
</match>
<source>
type tail
format json
time_key time
path /var/log/containers/*.log
pos_file /var/log/gcp-containers.log.pos
time_format %Y-%m-%dT%H:%M:%S.%NZ
tag reform.*
read_from_head true
</source>
<match reform.**>
type record_reformer
enable_ruby true
tag kubernetes.${tag_suffix[4].split('-')[0..-2].join('-')}
</match>
# We use 2 output stanzas - one to handle the container logs and one to handle
# the node daemon logs, the latter of which explicitly sends its logs to the
# compute.googleapis.com service rather than container.googleapis.com to keep
# them separate since most users don't care about the node logs.
<match kubernetes.**>
type google_cloud
# Set the buffer type to file to improve the reliability and reduce the memory consumption
buffer_type file
buffer_path /var/log/fluentd-buffers/kubernetes.containers.buffer
# Set queue_full action to block because we want to pause gracefully
# in case of the off-the-limits load instead of throwing an exception
buffer_queue_full_action block
# Set the chunk limit conservatively to avoid exceeding the GCL limit
# of 10MiB per write request.
buffer_chunk_limit 2M
# Cap the combined memory usage of this buffer and the one below to
# 2MiB/chunk * (6 + 2) chunks = 16 MiB
buffer_queue_limit 6
# Never wait more than 5 seconds before flushing logs in the non-error case.
flush_interval 5s
# Never wait longer than 30 seconds between retries.
max_retry_wait 30
# Disable the limit on the number of retries (retry forever).
disable_retry_limit
# Use multiple threads for processing.
num_threads 2
</match>
# Keep a smaller buffer here since these logs are less important than the user's
# container logs.
<match **>
type google_cloud
detect_subservice false
buffer_type file
buffer_path /var/log/fluentd-buffers/kubernetes.system.buffer
buffer_queue_full_action block
buffer_chunk_limit 2M
buffer_queue_limit 2
flush_interval 5s
max_retry_wait 30
disable_retry_limit
num_threads 2
</match>
- Start the container and check the logs.
export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-12-32 `
sudo docker logs -f $FLUENTD_ID
export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-14-13 `
sudo docker logs -f $FLUENTD_ID
Both images are built from fluentd-gcp-image. The only difference is the Fluentd version in the Gemfile. buffer-overflow-test-0-12-32
has Fluentd 0.12.32
, and can process the logs successfully with the config settings above. buffer-overflow-test-0-14-13
has Fluentd 0.14.13
, and it can't process the logs without a buffer overflow error unless we increase the buffer size to:
buffer_chunk_limit 8M
buffer_queue_limit 32
num_threads 8