Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

nikolaiderzhak · 2016-03-04T17:52:25Z

Got very chatty 25 nodes in elasticsearch dockerised cluster.

We use fluentd log driver for elasticssearch containers.

So saturation/slowness on fluent server (which stores data on standalone elasticsearch node in logstash format to view with kibana) led to panic and death of random elasticsearch containers.

From /var/log/syslog:

Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: panic: runtime error: invalid memory address or nil pointer dereference
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: [signal 0xb code=0x1 addr=0x0 pc=0x70808c]
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: goroutine 125885 [running]:
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: net.(*TCPConn).Write(0x0, 0xc8235a4000, 0x11857, 0x30000, 0x0, 0x0, 0x0)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: <autogenerated>:75 +0x2c
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc820eecb00, 0x0, 0x0)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:242 +0x1d6
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc820eecb00, 0xc8230d6000, 0x118, 0x146)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:154 +0x171

From fluent forwarder log:

2016-03-04 00:54:49 +0000 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="docker.49c656b5827d"
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:198:in `block in emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:187:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/output.rb:448:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:88:in `emit_stream'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:79:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:186:in `on_message'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `call'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `block in on_read_msgpack'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `feed_each'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `on_read_msgpack'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `call'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `on_read'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:123:in `on_readable'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:186:in `on_readable'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run_once'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:98:in `run'
2016-03-04 00:54:49 +0000 [error]: forward error error=#<Fluent::BufferQueueLimitError: queue size exceeds limit> error_class=Fluent::BufferQueueLimitError
  2016-03-04 00:54:49 +0000 [error]: suppressed same stacktrace

Let me know if you need full stack trace or more details. We really eager to fix that.

Because even if I start bigger fluentd server and standalone elasticsearch node I want to make sure
we will not start loosing data nodes due to panics during log spikes.

Related docker ticker: moby/moby#20960

The text was updated successfully, but these errors were encountered:

tagomoris · 2016-03-05T00:48:35Z

Can you create minimum code and environment to reproduce it?
Using docker containers for reproduction is too heavy for both of identifying core problem and fixing it.

nikolaiderzhak · 2016-03-05T00:55:03Z

Can not disagree. I will try and update you.

nikolaiderzhak · 2016-03-07T17:56:25Z

While I am looking to easier method to reproduce panic let us discuss possible solutions.

It seems we need to check net object is not nil before calling method. Is it not main reason of panic ?

tagomoris · 2016-03-07T18:38:26Z

@nikolaiderzhak your point looks correct. IMO it's particularly problematic under the situation with many communication errors.
Options are:

add a check for f.conn existence just before Write
check and fix exclusive control for f.conn especially for send and reconnect

The former will work well, but it makes situation a bit more complex (What should we do when f.conn is missing?).

nikolaiderzhak · 2016-03-07T19:17:37Z

The former will work well, but it makes situation a bit more complex (What should we do when f.conn is missing?).

Something safe. For me as DevOps engineer goal is make sure docker daemon will not die :)

I am trying staging setup to reproduce it now. Will try to emulate slow aggregator responses to trigger that chain. So we could try some fixes.

If you propose one I cloud quickly test if it works.

tagomoris · 2016-03-07T19:41:29Z

I'm sorry that I don't have enough time to work with this right now... but will merge (and try to merge it into Docker) ASAP when reproduction steps and patches are proposed.

nikolaiderzhak · 2016-03-08T00:59:42Z

Thanks !

I was not able to reproduce it locally, including docker containers with chatty service and lower buffer limits. So next I will try is to replicate production setup with smaller nodes in AWS. We need it to be done anyway for testing purposes.

Question to you. I see f.conn.Write() is wrapped to if f.conn == nil already (since 2013) .
How it happened we got this exception then ? Sorry - not familiar enough with code.

@ypjin @rdong Can you please check it too ?

tagomoris · 2016-03-08T01:29:01Z

@nikolaiderzhak checking f.conn must be done in exclusive process (in other words, in locking)

nikolaiderzhak · 2016-03-08T01:37:04Z

I see. It can be accessed before it initialised....

nikolai-derzhak-distillery · 2016-03-16T22:04:21Z

Got critical things on plate. Did not create pre-prod to reproduce it yet. Seems as will start working on his next week. Will update you.

nikolai-derzhak-distillery · 2016-03-22T17:35:38Z

I have synched with R&D team and we have plan to use logstash with gelf docker logging driver as interim solution (long term is kafka driver moby/moby#18604).

So I will stick with gelf driver for now.

Hope you or other contributor will find time to fix this production critical bug.

Otherwise people will go the sam route I did.

rorysavage77 · 2016-04-26T13:20:05Z

Any updates on this?

tagomoris · 2016-04-26T13:42:37Z

Nothing yet. I couldn't create tests or scripts to reproduce it.
@rcsavage do you have it?

toddlers · 2016-04-27T15:06:05Z

@tagomoris we have reproduced . We will provide the data soon.

tagomoris · 2016-04-27T15:07:28Z

Great to hear that! Then I'll fix it.

rorysavage77 · 2016-04-29T21:09:06Z

That is good news. How soon for a fix?

spanktar · 2016-08-02T18:24:45Z

This is a show-stopper for our fluentd adoption. Any word here on progress?

tagomoris · 2016-08-03T04:27:41Z

Please give me any code to show how to reproduce it...

EricFortin · 2016-09-14T19:00:08Z

@tagomoris

I managed to reproduce it since we have been hitting this more and more recently. I am not able to reproduce it on my machine but I can make it crash every time on an AWS instance running Amazon Linux and Docker 1.11.2. The time it takes to crash is variable though. My test setup is under Docker however since I am not that good in Go(and ruby) but I guess it will be easy enough for someone on the project to replicate the setup outside of Docker.

fluent.conf

<source>
  type forward
  port 24224
  bind 0.0.0.0
</source>

<match docker.*>
  type down

  buffer_type memory
  buffer_chunk_limit 5k
  buffer_queue_limit 32
</match>

out_down.rb plugin

module Fluent

class DownOutput < BufferedOutput
  Fluent::Plugin.register_output('down', self)

  unless method_defined?(:log)
    define_method(:log) { $log }
  end

  def initialize
    super
  end

  def configure(conf)
    # override default buffer_chunk_limit so it overflows very fast.
    conf['buffer_chunk_limit'] ||= '10k'
    super
  end

  def start
    super
  end

  def shutdown
    super
  end

  def format(tag, time, record)
    [tag, time, record].to_msgpack
  end

  def write(chunk)
    count = 0
    chunk.msgpack_each { |tag, time, record|

        count += 1
    }
    log.info "Processed #{count} entries in chunk."
  end
end

end

I spin one container running Fluentd and another container with this command:

docker run -d --log-driver fluentd --log-opt fluentd-async-connect=true --name logger alpine:3.4 sh -c 'while true; do echo {\"time\":\"2016-09-13T14:49:32.082Z\", \"mylog\":\"This is my log!\"}; done'

~~I tried reproducing on CentOS 7 with Docker 1.11.2 as well but didn't have luck. I tried CentOS because installing Docker 1.12 seems impossible on Amazon Linux and I wanted to compare both version.~~

Update: I also managed to crash docker 1.11.2 on CentOS 7.
Update#2: I also managed to crash Docker 1.12.

I found a workaround for people hitting the problem that may acceptable. In fluentd-0.12.26, a config option named buffer_queue_full_action has been introduced. Setting it to drop_oldest_chunk avoid the problem for us so far but with the obvious consequence of dropping some logs. I let you decide if this is acceptable.

EricFortin · 2016-09-15T17:56:50Z

@tagomoris I gave a try to writing a program in Go to reproduce and succeeded. I used the same fluent.conf as previously. Here's the Go source code:

package main

import (
    "time"
    "github.com/fluent/fluent-logger-golang/fluent"
)

func main() {
    f, err := fluent.New(fluent.Config{})
    if err != nil {
        panic(err)
    }

    data := map[string]string{
        "mylog": "This is my json log!",
    }
    tm := time.Now()
    for {
        if err := f.PostWithTime("docker.logs", tm, data); err != nil {
            panic(err)
        }
    }
}

It generated the same callstack as the crash within docker:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4e785c]

goroutine 1 [running]:
panic(0x5a95c0, 0xc82000a110)
    /usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
net.(*TCPConn).Write(0x0, 0xc82037a000, 0x34e0, 0x78000, 0x0, 0x0, 0x0)
    <autogenerated>:69 +0x2c
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc82007e000, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:287 +0x1ec
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc82007e000, 0xc8202ddc70, 0x2f, 0x47)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:172 +0x189
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).EncodeAndPostData(0xc82007e000, 0x5e7640, 0xb, 0xecf6cd5f0, 0xc80c58991b, 0x6d8bc0, 0x567f00, 0xc8202deed0, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:163 +0x257
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostWithTime(0xc82007e000, 0x5e7640, 0xb, 0xecf6cd5f0, 0xc80c58991b, 0x6d8bc0, 0x568000, 0xc820082060, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:155 +0x998
main.main()
    /home/eric/go/src/github.com/EricFortin/fluentdCrash/logger.go:20 +0x1ea

tagomoris · 2016-09-16T11:02:54Z

I'm trying just same code & configuration on my Mac, but it's not reproduced yet (in 10 minutes).
Is it Linux specific problem?
@EricFortin How long time you need to reproduce it?

EricFortin · 2016-09-16T11:07:59Z

@tagomoris So far, I have reproduced it only in VM on AWS. I run Linux in a VM on my local machine and weren't able to reproduce it either. Since we only experienced it in AWS, where we run our stuff, I developed the test case there. It usually takes between 2-10 minutes to crash.

tagomoris · 2016-09-16T13:17:08Z

@EricFortin Thank you for great report! I'll try it on AWS too and some patches next week.

tagomoris · 2016-09-20T06:47:32Z

@EricFortin Finally, I made it! (w/ Amazon Linux 2016.03.3)

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4e959c]

goroutine 1 [running]:
net.(*TCPConn).Write(0x0, 0xc820500000, 0x242b, 0x60000, 0x0, 0x0, 0x0)
    <autogenerated>:75 +0x2c
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc82007e000, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:287 +0x1ec
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc82007e000, 0xc8203efea0, 0x2f, 0x47)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:172 +0x189
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).EncodeAndPostData(0xc82007e000, 0x5ea680, 0xb, 0xecf72d1cc, 0xc8336109e0, 0x6dea00, 0x568880, 0xc8203e7b30, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:163 +0x257
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostWithTime(0xc82007e000, 0x5ea680, 0xb, 0xecf72d1cc, 0xc8336109e0, 0x6dea00, 0x568980, 0xc8200106f0, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:155 +0x996
main.main()
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/test/main.go:19 +0x1e9

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/golang/src/runtime/asm_amd64.s:1721 +0x1

goroutine 32359 [semacquire]:
sync.runtime_Semacquire(0xc82007e0a0)
    /usr/lib/golang/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc82007e09c)
    /usr/lib/golang/src/sync/mutex.go:82 +0x1c4
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).reconnect.func1(0xc82007e000)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:255 +0x6a
created by github.com/fluent/fluent-logger-golang/fluent.(*Fluent).reconnect
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:267 +0x35

nikolai-derzhak-distillery · 2016-09-20T06:54:37Z

👍

tagomoris · 2016-09-20T07:16:11Z

I updated the code not to panic but retries to send message, with showing err content.

package main

import (
    "fmt"
    "time"
    "github.com/fluent/fluent-logger-golang/fluent"
)

func main() {
    f, err := fluent.New(fluent.Config{})
    if err != nil {
        panic(err)
    }

    data := map[string]string{
        "mylog": "This is my json log!",
    }
    for {
        tm := time.Now()
        if err := f.PostWithTime("docker.logs", tm, data); err != nil {
            fmt.Printf("error: %s\n", err);
            continue;
        }
    }
}

With the change of #30, I got:

"write tcp 127.0.0.1:36746->127.0.0.1:24224: write: connection reset by peer" at first
"fluent#send: can't send logs, client is reconnecting" for a while
"Buffer full" at last

With this behavior, fluent-logger-golang (and fluentd logging driver of docker) doesn't stop Docket itself.

EricFortin · 2016-09-20T10:51:54Z

Nice !

tagomoris · 2016-09-20T11:41:29Z

I created a pull-request for docker to solve this problem!
moby/moby#26740

tagomoris · 2016-09-20T15:10:47Z

That was merged.
Look forward to Docker 1.13.0 release!

hridyeshpant · 2016-10-28T01:45:36Z

@EricFortin can you give your configuration to use buffer_queue_full_action.
Currently i am using below configuration in fluentd setting
<match docker.**>
type elasticsearch
logstash_format true
include_tag_key true
tag_key image
flush_interval 10s

do you mean adding below paramter ?
buffer_queue_full_action drop_oldest_chunk

EricFortin · 2016-10-28T11:07:23Z

@hridyeshpant Yes you add that to your elasticsearch plugin config. It is available because this plugin inherits BufferedOutput. I am using flume on my side but since it also inherits from BufferedOutput, I can use this settings. Here's my config:

<match docker.*>
  type flume
  host FLUME_HOST
  port FLUME_PORT
  remove_prefix docker

  buffer_queue_full_action drop_oldest_chunk
</match>

nikolaiderzhak mentioned this issue Mar 4, 2016

Docker daemon dies when fluent driver can not send messages to saturated fluent server moby/moby#20960

Closed

mlaventure mentioned this issue Jun 22, 2016

docker engine crashes with fluentd plugin when gelf endpoint is down moby/moby#23826

Closed

tagomoris self-assigned this Sep 16, 2016

tagomoris mentioned this issue Sep 20, 2016

Fix race condition #30

Merged

tagomoris mentioned this issue Sep 20, 2016

Update fluent-logger-golang to v1.2.0. moby/moby#26740

Merged

tagomoris closed this as completed Sep 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

nikolaiderzhak commented Mar 4, 2016

tagomoris commented Mar 5, 2016

nikolaiderzhak commented Mar 5, 2016

nikolaiderzhak commented Mar 7, 2016

tagomoris commented Mar 7, 2016

nikolaiderzhak commented Mar 7, 2016

tagomoris commented Mar 7, 2016

nikolaiderzhak commented Mar 8, 2016

tagomoris commented Mar 8, 2016

nikolaiderzhak commented Mar 8, 2016

nikolai-derzhak-distillery commented Mar 16, 2016

nikolai-derzhak-distillery commented Mar 22, 2016

rorysavage77 commented Apr 26, 2016

tagomoris commented Apr 26, 2016

toddlers commented Apr 27, 2016

tagomoris commented Apr 27, 2016

rorysavage77 commented Apr 29, 2016

spanktar commented Aug 2, 2016

tagomoris commented Aug 3, 2016

EricFortin commented Sep 14, 2016 •

edited

Loading

EricFortin commented Sep 15, 2016

tagomoris commented Sep 16, 2016

EricFortin commented Sep 16, 2016 •

edited

Loading

tagomoris commented Sep 16, 2016

tagomoris commented Sep 20, 2016

nikolai-derzhak-distillery commented Sep 20, 2016

tagomoris commented Sep 20, 2016 •

edited

Loading

EricFortin commented Sep 20, 2016

tagomoris commented Sep 20, 2016

tagomoris commented Sep 20, 2016

hridyeshpant commented Oct 28, 2016

EricFortin commented Oct 28, 2016

Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

Comments

nikolaiderzhak commented Mar 4, 2016

tagomoris commented Mar 5, 2016

nikolaiderzhak commented Mar 5, 2016

nikolaiderzhak commented Mar 7, 2016

tagomoris commented Mar 7, 2016

nikolaiderzhak commented Mar 7, 2016

tagomoris commented Mar 7, 2016

nikolaiderzhak commented Mar 8, 2016

tagomoris commented Mar 8, 2016

nikolaiderzhak commented Mar 8, 2016

nikolai-derzhak-distillery commented Mar 16, 2016

nikolai-derzhak-distillery commented Mar 22, 2016

rorysavage77 commented Apr 26, 2016

tagomoris commented Apr 26, 2016

toddlers commented Apr 27, 2016

tagomoris commented Apr 27, 2016

rorysavage77 commented Apr 29, 2016

spanktar commented Aug 2, 2016

tagomoris commented Aug 3, 2016

EricFortin commented Sep 14, 2016 • edited Loading

EricFortin commented Sep 15, 2016

tagomoris commented Sep 16, 2016

EricFortin commented Sep 16, 2016 • edited Loading

tagomoris commented Sep 16, 2016

tagomoris commented Sep 20, 2016

nikolai-derzhak-distillery commented Sep 20, 2016

tagomoris commented Sep 20, 2016 • edited Loading

EricFortin commented Sep 20, 2016

tagomoris commented Sep 20, 2016

tagomoris commented Sep 20, 2016

hridyeshpant commented Oct 28, 2016

EricFortin commented Oct 28, 2016

EricFortin commented Sep 14, 2016 •

edited

Loading

EricFortin commented Sep 16, 2016 •

edited

Loading

tagomoris commented Sep 20, 2016 •

edited

Loading