Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journald position file got corrupted during multiple fluentd gracefulReloads #3332

Closed
alex-vmw opened this issue Apr 16, 2021 · 2 comments · Fixed by #3335
Closed

Journald position file got corrupted during multiple fluentd gracefulReloads #3332

alex-vmw opened this issue Apr 16, 2021 · 2 comments · Fixed by #3335
Labels

Comments

@alex-vmw
Copy link

Describe the bug
We were performing tests that were issuing multiple /api/config.gracefulReload calls to fluentd. After a few reloads we noticed that on a single node the journald position file got corrupted.

Error that we saw:

2021-04-08 00:06:50 +0000 [error]: [in_systemd_docker] failed to read data from plugin storage file path="/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json" error_class=Yajl::ParseError error="lexical error: invalid char in json text.\n                                     {}journal\":\"s=f32000cba2bf4de4964\n                     (right here) ------^\n"
2021-04-08 00:06:50 +0000 [error]: fluent/log.rb:369:error: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Unexpected error: failed to read data from plugin storage file: '/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json'"
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:99:in `rescue in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:89:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:111:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:51:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:317:in `add_source'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:158:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:105:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:80:in `run_configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/supervisor.rb:551:in `run_supervisor'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/command/fluentd.rb:341:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/bin/fluentd:8:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `load'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `<main>'

What we saw in the corrupted file was this:

$ cat /var/log/-kube-fluentd-operator--fluentd-journald-cursor.json
{}journal":"s=f32000cba2bf4de4964576140ffd3661;i=19b518;b=b80855c9fe2d410b8b5d6fff5326c16d;m=136787196bf6;t=5bf6a3ba60f5b;x=7c49636868ee757c"}

As you can see, the {} in front was supposed to be {" instead.

To Reproduce
Restart fluentd many times via /api/config.gracefulReload call.

Expected behavior
journald position file shouldn't get corrupted with an invalid character.

Your Environment

  • Fluentd or td-agent version: fluentd 1.11.2
  • Operating system: Ubuntu 20.04.1 LTS
  • Kernel version: 5.4.0-62-generic

If you hit the problem with older fluentd version, try latest version first.
We tested /api/config.gracefulReload calls with different fluentd versions, but only hit this issue once, so I am guessing it is a pretty rare race condition. We decided to report it anyways, just in case.

Your Configuration

<ROOT>
  <system>
    log_level info
    rpc_endpoint "127.0.0.1:24444"
  </system>
  <source>
    @type systemd
    @id in_systemd_docker
    path "/var/log/journal"
    tag "systemd.unit"
    read_from_head false
    <storage>
      @type "local"
      persistent true
      path "/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json"
    </storage>
    <entry>
      field_map {"_SYSTEMD_UNIT":"unit","MESSAGE":"log","_PID":["pid"],"_PRIORITY":"priority","_COMM":"cmd","_HOSTNAME":"hostname"}
      field_map_strict true
      fields_lowercase true
    </entry>
  </source>
  <match systemd.unit>
    @type rewrite_tag_filter
    <rule>
      key "unit"
      pattern ^(.+)$
      tag "systemd.$1"
    </rule>
  </match>
  <filter systemd.kubelet.service>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_container_logs
    path "/var/log/containers/*.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-containers.log.pos"
    tag "kubernetes.*"
    read_from_head true
    <parse>
      @type "multiline"
      format1 /^(?<partials>([^\n]+ (stdout|stderr) P [^\n]+\n)*)/
      format2 /(?<time>[^\n]+) (?<stream>stdout|stderr) F (?<log>[^\n]*)/
      format3 /|(?<json>{.*})/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </parse>
  </source>
  <filter kubernetes.**>
    @type record_transformer
    @id filter_crio_container_logs
    enable_ruby true
    remove_keys partials
    <record>
      log ${record["partials"]&.gsub(/.+ (stdout|stderr) P (.+)\n/, '\\2')}${record["log"]}
    </record>
  </filter>
  <filter kubernetes.**>
    @type parser
    @id filter_docker_container_logs
    key_name "json"
    remove_key_name_field true
    reserve_data true
    emit_invalid_record_to_error false
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      time_type string
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_startupscript
    path "/var/log/startupscript.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-startupscript.log.pos"
    tag "startupscript"
    <parse>
      @type "syslog"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_docker
    path "/var/log/docker.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-docker.log.pos"
    tag "docker"
    <parse>
      @type "regexp"
      expression /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_etcd
    path "/var/log/etcd.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-etcd.log.pos"
    tag "k8s.etcd"
    <parse>
      @type "none"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kubelet
    multiline_flush_interval 5s
    path "/var/log/kubelet.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kubelet.log.pos"
    tag "k8s.kubelet"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_proxy
    multiline_flush_interval 5s
    path "/var/log/kube-proxy.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-proxy.log.pos"
    tag "k8s.kube-proxy"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver
    multiline_flush_interval 5s
    path "/var/log/kube-apiserver.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-apiserver.log.pos"
    tag "k8s.kube-apiserver"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_controller_manager
    multiline_flush_interval 5s
    path "/var/log/kube-controller-manager.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-controller-manager.log.pos"
    tag "k8s.kube-controller-manager"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_scheduler
    multiline_flush_interval 5s
    path "/var/log/kube-scheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-scheduler.log.pos"
    tag "k8s.kube-scheduler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_rescheduler
    multiline_flush_interval 5s
    path "/var/log/rescheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-rescheduler.log.pos"
    tag "k8s.rescheduler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_glbc
    multiline_flush_interval 5s
    path "/var/log/glbc.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-glbc.log.pos"
    tag "k8s.glbc"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_cluster_autoscaler
    multiline_flush_interval 5s
    path "/var/log/cluster-autoscaler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-cluster-autoscaler.log.pos"
    tag "k8s.cluster-autoscaler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver_audit
    multiline_flush_interval 5s
    path "/var/log/kubernetes/kube-apiserver-audit.log"
    pos_file "/var/log/-kube-fluentd-operator--kube-apiserver-audit.log.pos"
    tag "k8s.kube-apiserver-audit"
    <parse>
      @type "multiline"
      format_firstline "/^\\S+\\s+AUDIT:/"
      format1 /^(?<time>\S+) AUDIT:(?: (?:id="(?<id>(?:[^"\\]|\\.)*)"|ip="(?<ip>(?:[^"\\]|\\.)*)"|method="(?<method>(?:[^"\\]|\\.)*)"|user="(?<user>(?:[^"\\]                                                                                                                                                                |\\.)*)"|groups="(?<groups>(?:[^"\\]|\\.)*)"|as="(?<as>(?:[^"\\]|\\.)*)"|asgroups="(?<asgroups>(?:[^"\\]|\\.)*)"|namespace="(?<namespace>(?:[^"\\]|\\.)*)"|ur                                                                                                                                                                i="(?<uri>(?:[^"\\]|\\.)*)"|response="(?<response>(?:[^"\\]|\\.)*)"|\w+="(?:[^"\\]|\\.)*"))*/
      time_format "%Y-%m-%dT%T.%L%Z"
    </parse>
  </source>
  <filter kubernetes.**>
    @type kubernetes_metadata
    @id filter_kube_metadata
  </filter>
  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>
  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubernetes_namespace_container_name ${record["kubernetes"]["namespace_name"]}.${record["kubernetes"]["pod_name"]}.${record["kubernetes"]["container_nam                                                                                                                                                                e"]}
      container_info ${record["docker"]["container_id"]}-${record["stream"]}
    </record>
  </filter>
  <match kubernetes.**>
    @type rewrite_tag_filter
    <rule>
      key "kubernetes_namespace_container_name"
      pattern ^(.+)$
      tag "kube.$1"
    </rule>
  </match>
  <filter kube.*.*.*>
    @type record_modifier
    remove_keys "dummy_"
    <record>
      dummy_ ${record["kubernetes"]&.delete("master_url"); record["kubernetes"]&.delete("namespace_id"); if record["kubernetes"]&.has_key?("labels"); record[                                                                                                                                                                "kubernetes"]["labels"].delete("pod-template-generation"); record["kubernetes"]["labels"].delete("controller-revision-hash");  record["kubernetes"]["labels"]                                                                                                                                                                .delete("pod-template-hash"); end; nil}
    </record>
  </filter>
  <filter kube.*.*.*>
    @type record_transformer
    remove_keys kubernetes_namespace_container_name
  </filter>
  <filter kube.kube-system.**>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type prometheus
  </source>
  <source>
    @type prometheus_monitor
  </source>
  <source>
    @type prometheus_output_monitor
  </source>
  <match systemd.** kube.kube-system.**>
    @type copy
    <store>
      @id X
      @type "vmware_loginsight"
      host "X.X.X.X"
      include_tag_key true
      log_text_keys ["log","msg","message"]
      port 9000
      scheme "http"
      serializer "json"
      ssl_verify false
      tag_key "tag"
    </store>
  </match>
  <match **>
    @type null
  </match>
</ROOT>

Your Error Log

2021-04-08 00:06:50 +0000 [error]: [in_systemd_docker] failed to read data from plugin storage file path="/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json" error_class=Yajl::ParseError error="lexical error: invalid char in json text.\n                                     {}journal\":\"s=f32000cba2bf4de4964\n                     (right here) ------^\n"
2021-04-08 00:06:50 +0000 [error]: fluent/log.rb:369:error: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Unexpected error: failed to read data from plugin storage file: '/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json'"
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:99:in `rescue in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:89:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:111:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:51:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:317:in `add_source'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:158:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:105:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:80:in `run_configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/supervisor.rb:551:in `run_supervisor'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/command/fluentd.rb:341:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/bin/fluentd:8:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `load'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `<main>'
@fujimotos
Copy link
Member

fujimotos commented Apr 19, 2021

We were performing tests that were issuing multiple /api/config.gracefulReload calls to fluentd. After a few reloads we noticed that on a single node the journald position file got corrupted.

I spent a few hours trying to reproduce this issue this morning, but I couldn't.

I set up the following configuration:

<system>
  log_level error
  rpc_endpoint "127.0.0.1:24444"
</system>
<source>
  @type systemd
  @id in_systemd_docker
  path "/var/log/journal"
  tag "systemd.unit"
  read_from_head false
  <storage>
    @type "local"
    persistent true
    path "kube-fluentd-operator--fluentd-journald-cursor.json"
  </storage>
  <entry>
    field_map {"_SYSTEMD_UNIT":"unit","MESSAGE":"log","_PID":["pid"],"_PRIORITY":"priority","_COMM":"cmd","_HOSTNAME":"hostname"}
    field_map_strict true
    fields_lowercase true
  </entry>
</source>
<match systemd.**>
  @type stdout
</match>

and let the following bash script running for a while :

$ while true; do curl http://localhost:24444/api/config.gracefulReload; n=$((n+1)); echo " `date` $n"; sleep 1; done

I reloaded Fluentd 1000 times in total, but did not see any file corruption issue like the
one OP reported. So the issue that @alex-vmw encountered seems to be pretty much,
much rare issue :(

I'll try to check if reloading another 1000 times reproduces the issue, but with no
reproduction, it is hard to debug this root cause.

@fujimotos
Copy link
Member

fujimotos commented Apr 19, 2021

It occured to me that it might be concurrency issue. So I set up two
curl processes rapid-firing at /api/config.gracefulReload:

$ while true; do curl http://localhost:24444/api/config.gracefulReload; n=$((n+1)); sleep 0.1; done &
$ while true; do curl http://localhost:24444/api/config.gracefulReload; n=$((n+1)); sleep 0.1; done &

It turned out that Journald plugin is indeed racy on gracefulReload.
When two requests arrive at the same time, Fluentd start spitting suspicious
errors like the following, suggesting some unsafe writes:

2021-04-19 03:59:57 +0000 [error]: [in_systemd_docker] failed to save data for plugin storage to
file path="kube-fluentd-operator--fluentd-journald-cursor.json" tmp="kube-fluentd-operator--fluentd-journald-cursor.json.tmp"
error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_rename -
(kube-fluentd-operator--fluentd-journald-cursor.json.tmp, kube-fluentd-operator--fluentd-journald-cursor.json)"

#3335 is my fix for the race issue. It resolves the issue by chaniging
LocalStorage.save() to be atomic, so proof of concurrent writes.

While I did not reproduce the exact error of @alex-vmw's report, but I
believe there is a good chance that the issue is resolved by the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants