Bug Description
We're running a deployment of Fluentd-Elasticsearch Helm Chart (Fluentd version 1.4.1) within a Kubernetes environment.
Every Pod in the daemonSet will, after N hours (being N constant, between 8 and 36, and different for each daemonSet Pod), fail to reply to k8s liveness probe and be restarted by k8s.
I.e.: Monitoring Pod's CPU and Memory with Prometheus, we can see the restarts over the last 24h of a given Pod (below example restarts every 10h):

Note: From our troubleshooting, it seems no logs are lost and no service deprecation is observed. We can only tell that Pod was restarted.
Could this be related with Fluentd's buffer becoming full or too large?
Steps To Reproduce
Run Fluentd-Elasticsearch Helm Chart on a k8s cluster for more than 24h.
Expected behavior
Fluentd Pods are not restarted.
Your Environment
- Fluentd or td-agent version:
1.4.1
- Operating system:
debian 9 (as k8s node)
- Kernel version:
4.15
Your Configuration
system.conf: |-
<system>
root_dir /tmp/fluentd-buffers/
</system>
output.conf: |-
<match **>
@id elasticsearch
@type elasticsearch_dynamic
@log_level info
include_tag_key true
host dev-elastic.vg-internal
port 80
scheme http
ssl_version "TLSv1"
ssl_verify false
logstash_prefix development-${record['kubernetes']['labels']['app']}
logstash_format true
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever
retry_max_interval 30
overflow_action block
</buffer>
</match>
containers.input.conf: |-
<source>
@id fluentd-containers.log
@type tail
path /var/log/containers/*.log
pos_file /var/log/containers.log.pos
tag raw.kubernetes.*
read_from_head true
<parse>
@type multi_format
<pattern>
format json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
</parse>
</source>
# Detect exceptions in the log output and forward them as one log entry.
<match raw.kubernetes.**>
@id raw.kubernetes
@type detect_exceptions
remove_tag_prefix raw
message log
stream stream
multiline_flush_interval 5
max_bytes 500000
max_lines 1000
</match>
# Concatenate multi-line logs
<filter **>
@id filter_concat
@type concat
key message
multiline_end_regexp /\n$/
separator ""
</filter>
# Enriches records with Kubernetes metadata
<filter kubernetes.**>
@id filter_kubernetes_metadata
@type kubernetes_metadata
</filter>
# Remove unnecessary fields (keys)
<filter kubernetes.**>
@type record_transformer
remove_keys $.kubernetes.labels.pod-template-hash, $.kubernetes.namespace_id, $.kubernetes.pod_id, $.kubernetes.container_image_id, $.docker.container_id, message
</filter>
# Discard istio sidecar logs
<filter kubernetes.**>
@id filter_istio_sidecars
@type grep
<exclude>
key $.kubernetes.container_name
pattern /(istio-proxy|istio-init)/
</exclude>
</filter>
# Concatenate multi-line logs for Schedule
<filter kubernetes.var.log.containers.schedule**>
@id filter_concat_schedule
@type concat
key log
multiline_start_regexp /^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},{0,1}\d{0,3})\].*$/
separator ""
</filter>
# Concatenate multi-line logs for Ambassador
<filter kubernetes.var.log.containers.microservices-ambassador**>
@id filter_concat_ambassador
@type concat
key log
multiline_start_regexp /^ACCESS \[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z\] .+$/
separator ""
</filter>
#Standard Ambassador logs
<filter kubernetes.var.log.containers.microservices-ambassador**>
@type parser
key_name log
reserve_data true
<parse>
@type regexp
expression /^ACCESS \[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z\] "(?<http_method>[A-Z-]+) (?<requested_url>[^ ]+) (?<http_version>[^ ]+)" (?<http_response>\d+) (- (?<http-received-bytes>\d+)|[A-Z]+) (?<http-sent-bytes>\d+) (?<ambassador-duration>\d+) (?<envoy-duration>\d+) (\d+|- )*"(?<client_ip>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}|-)" "(Mozilla\/5.0 ){0,1}(\((?<client_os>(X11; Fedora; |X11; Ubuntu; |X11; |Linux; ){0,1}(?<client_os_family>[A-Za-z]+)( |;){1}[^)]*)\) (?<user_agent>.*)|(?<user_agent>.+))" "(?<request_id>[^ ]+)" "(?<target_host>[^ ]+)" "(?<backend_endpoint>[^ ]+)"$/
types http-received-bytes:integer,http-sent-bytes:integer,ambassador-duration:integer,envoy-duration:integer
</parse>
</filter>
<filter kubernetes.var.log.containers.auth** kubernetes.var.log.containers.schedule**>
@type parser
key_name log
reserve_data true
<parse>
@type regexp
expression /^\[(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},{0,1}\d{0,3})\] (?<log_level>[^ ]+) (Request )*(?<request>[^:]+): (?<custom_message>[^\[\]].+) \[in (?<file>.+)\][ ]*$/
time_format %Y-%m-%d %H:%M:%S
</parse>
</filter>
#Standard WSGI logs
<filter kubernetes.var.log.containers.schedule**>
@type parser
key_name log
reserve_data true
<parse>
@type regexp
expression /^\[(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},{0,1}\d{0,3})\] \[(UWSGI){1}\] "(?<http_method>[^ ]+) (?<requested_url>[^ ]+) HTTP\/\d\.\d" (?<http_response>\d{3}) (?<http_reponse_size>\d+) "(?<http_client_addr>[^ ]+)" "(?<http_client_browser>.+)"$/
time_format %Y-%m-%d %H:%M:%S
</parse>
</filter>
#Standard NGINX logs
<filter kubernetes.var.log.containers.credits** filter kubernetes.var.log.containers.leads**>
@type parser
key_name log
reserve_data true
<parse>
@type regexp
expression /^(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - \[.+\] "(?<http_method>[^ ]+) (?<requested_url>[^ ]+) (?<http_version>[^ ]+)" (?<http_response>\d{3}) (?<http_response_size>\d+) "(?<http_client_url>.+)" "(?<http_client_browser>.+)"$/
time_format %Y-%m-%d %H:%M:%S
</parse>
</filter>
#New json-formatted logs
<filter kubernetes.var.log.containers.credits** kubernetes.var.log.containers.leads**>
@type parser
key_name log
reserve_data true
<parse>
@type regexp
expression /^(?<cgi_message>.*)PHP message: (\[(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},{0,1}\d{0,3})\] [^{}]+:)?( .* )?(?<vmessage>\{.+\}).*$/
time_format %Y-%m-%d %H:%M:%S
</parse>
</filter>
<filter kubernetes.var.log.containers.credits** kubernetes.var.log.containers.leads**>
@type parser
key_name vmessage
reserve_data true
<parse>
@type json
json_parser json
</parse>
</filter>
<filter kubernetes.var.log.containers.microservices-ambassador**>
@type parser
key_name requested_url
reserve_data true
<parse>
@type regexp
expression /^\/v\d\/((?<service>auth)\/(?<api-feature>login|refresh).*|[a-z]+\/{0,1}(?<club_id>\d+)*\/{0,1}(members\/\d+\/(?<service>credits)|(?<service>[a-z]+)){0,1}\/{0,1}((?<api-feature>[a-z]+)|(?<api-feature>[0-9]+))*\/{0,1}(?<api-action>[a-z]+)*\?{0,1}(?<api-params>.*)*)$/
types club_id:integer
</parse>
</filter>
# Fixes json fields in Elasticsearch
<filter kubernetes.**>
@id filter_parser
@type parser
key_name log
reserve_data true
remove_key_name_field true
<parse>
@type multi_format
<pattern>
format json
</pattern>
<pattern>
format none
</pattern>
</parse>
</filter>
Your Error Log
Before the Pod is restarted, we often encounter an error that is repeated many times within 10-15minutes before the liveness probe fails. The error is not always the same. An example is:
{"log":"2019-07-01 09:10:50 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::Parser::ParserError error=\"pattern not match with data'2019-07-01T09:10:50.297+0000 I NETWORK [conn472250] received client metadata from 127.0.0.1:45792 conn472250: { application: { name: \\\"MongoDB Shell\\\" }, driver: { name: \\\"MongoDB Internal Client\\\", version: \\\"4.0.10\\\" }, os: { type: \\\"Linux\\\", name: \\\"Ubuntu\\\", architecture: \\\"x86_64\\\", version: \\\"16.04\\\" } }\\n'\" location=nil tag=\"kubernetes.var.log.containers.schedule-db-mongodb-replicaset-0_production_mongodb-replicaset-dc618662c898452ccaa4229906556b078fb06a2b0aa22e4813f2924a2f430246.log\" time=2019-07-01 09:10:50.297995034 +0000 record={\"log\"=\u003e\"2019-07-01T09:10:50.297+0000 I NETWORK [conn472250] received client metadata from 127.0.0.1:45792 conn472250: { application: { name: \\\"MongoDB Shell\\\" }, driver: { name: \\\"MongoDB Internal Client\\\", version: \\\"4.0.10\\\" }, os: { type: \\\"Linux\\\", name: \\\"Ubuntu\\\", architecture: \\\"x86_64\\\", version: \\\"16.04\\\" } }\\n\", \"stream\"=\u003e\"stdout\", \"docker\"=\u003e{}, \"kubernetes\"=\u003e{\"container_name\"=\u003e\"mongodb-replicaset\", \"namespace_name\"=\u003e\"production\", \"pod_name\"=\u003e\"schedule-db-mongodb-replicaset-0\", \"container_image\"=\u003e\"mongo:latest\", \"labels\"=\u003e{\"app\"=\u003e\"mongodb-replicaset\", \"controller-revision-hash\"=\u003e\"schedule-db-mongodb-replicaset-5cb445f849\", \"release\"=\u003e\"schedule-db\", \"statefulset_kubernetes_io/pod-name\"=\u003e\"schedule-db-mongodb-replicaset-0\"},\"host\"=\u003e\"ip-10-11-42-244.eu-west-1.compute.internal\", \"master_url\"=\u003e\"https://100.64.0.1:443/api\", \"namespace_labels\"=\u003e{\"istio-injection\"=\u003e\"enabled\"}}}\n","stream":"stdout","time":"2019-07-01T09:10:50.799715554Z"}
In this case, the warning is expected as not all logs are expected to be parsed by this pattern
Additional context
- Elasticsearch version: 6.7 runs in a separate k8s cluster and does not report issues. This Elasticsearch receives logs from Fluentd and legacy Logstash records.
Bug Description
We're running a deployment of Fluentd-Elasticsearch Helm Chart (Fluentd version 1.4.1) within a Kubernetes environment.
Every Pod in the daemonSet will, after N hours (being N constant, between 8 and 36, and different for each daemonSet Pod), fail to reply to k8s liveness probe and be restarted by k8s.
I.e.: Monitoring Pod's CPU and Memory with Prometheus, we can see the restarts over the last 24h of a given Pod (below example restarts every 10h):
Could this be related with Fluentd's buffer becoming full or too large?
Steps To Reproduce
Run Fluentd-Elasticsearch Helm Chart on a k8s cluster for more than 24h.
Expected behavior
Fluentd Pods are not restarted.
Your Environment
1.4.1debian 9(as k8s node)4.15Your Configuration
Your Error Log
Before the Pod is restarted, we often encounter an error that is repeated many times within 10-15minutes before the liveness probe fails. The error is not always the same. An example is:
Additional context