Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon dies when fluent driver can not send messages to saturated fluent server #29

Closed
nikolaiderzhak opened this issue Mar 4, 2016 · 31 comments
Assignees

Comments

@nikolaiderzhak
Copy link

Got very chatty 25 nodes in elasticsearch dockerised cluster.

We use fluentd log driver for elasticssearch containers.

So saturation/slowness on fluent server (which stores data on standalone elasticsearch node in logstash format to view with kibana) led to panic and death of random elasticsearch containers.

From /var/log/syslog:

Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: panic: runtime error: invalid memory address or nil pointer dereference
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: [signal 0xb code=0x1 addr=0x0 pc=0x70808c]
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: goroutine 125885 [running]:
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: net.(*TCPConn).Write(0x0, 0xc8235a4000, 0x11857, 0x30000, 0x0, 0x0, 0x0)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: <autogenerated>:75 +0x2c
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc820eecb00, 0x0, 0x0)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:242 +0x1d6
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc820eecb00, 0xc8230d6000, 0x118, 0x146)
Mar  4 01:41:02 es-100-pem-300-data-1 docker[21750]: /usr/src/docker/vendor/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:154 +0x171

From fluent forwarder log:

2016-03-04 00:54:49 +0000 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="docker.49c656b5827d"
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:198:in `block in emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:187:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/output.rb:448:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:88:in `emit_stream'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/event_router.rb:79:in `emit'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:186:in `on_message'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `call'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:276:in `block in on_read_msgpack'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `feed_each'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:275:in `on_read_msgpack'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `call'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:261:in `on_read'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:123:in `on_readable'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/io.rb:186:in `on_readable'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run_once'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/cool.io-1.4.3/lib/cool.io/loop.rb:88:in `run'
  2016-03-04 00:54:49 +0000 [warn]: /usr/lib/ruby/gems/2.2.0/gems/fluentd-0.12.20/lib/fluent/plugin/in_forward.rb:98:in `run'
2016-03-04 00:54:49 +0000 [error]: forward error error=#<Fluent::BufferQueueLimitError: queue size exceeds limit> error_class=Fluent::BufferQueueLimitError
  2016-03-04 00:54:49 +0000 [error]: suppressed same stacktrace

Let me know if you need full stack trace or more details. We really eager to fix that.

Because even if I start bigger fluentd server and standalone elasticsearch node I want to make sure
we will not start loosing data nodes due to panics during log spikes.

Related docker ticker: moby/moby#20960

@tagomoris
Copy link
Member

Can you create minimum code and environment to reproduce it?
Using docker containers for reproduction is too heavy for both of identifying core problem and fixing it.

@nikolaiderzhak
Copy link
Author

Can not disagree. I will try and update you.

@nikolaiderzhak
Copy link
Author

While I am looking to easier method to reproduce panic let us discuss possible solutions.

It seems we need to check net object is not nil before calling method. Is it not main reason of panic ?

@tagomoris
Copy link
Member

@nikolaiderzhak your point looks correct. IMO it's particularly problematic under the situation with many communication errors.
Options are:

  • add a check for f.conn existence just before Write
  • check and fix exclusive control for f.conn especially for send and reconnect

The former will work well, but it makes situation a bit more complex (What should we do when f.conn is missing?).

@nikolaiderzhak
Copy link
Author

The former will work well, but it makes situation a bit more complex (What should we do when f.conn is missing?).

Something safe. For me as DevOps engineer goal is make sure docker daemon will not die :)

I am trying staging setup to reproduce it now. Will try to emulate slow aggregator responses to trigger that chain. So we could try some fixes.

If you propose one I cloud quickly test if it works.

@tagomoris
Copy link
Member

I'm sorry that I don't have enough time to work with this right now... but will merge (and try to merge it into Docker) ASAP when reproduction steps and patches are proposed.

@nikolaiderzhak
Copy link
Author

Thanks !

I was not able to reproduce it locally, including docker containers with chatty service and lower buffer limits. So next I will try is to replicate production setup with smaller nodes in AWS. We need it to be done anyway for testing purposes.

Question to you. I see f.conn.Write() is wrapped to if f.conn == nil already (since 2013) .
How it happened we got this exception then ? Sorry - not familiar enough with code.

@ypjin @rdong Can you please check it too ?

@tagomoris
Copy link
Member

@nikolaiderzhak checking f.conn must be done in exclusive process (in other words, in locking)

@nikolaiderzhak
Copy link
Author

I see. It can be accessed before it initialised....

@nikolai-derzhak-distillery

Got critical things on plate. Did not create pre-prod to reproduce it yet. Seems as will start working on his next week. Will update you.

@nikolai-derzhak-distillery

I have synched with R&D team and we have plan to use logstash with gelf docker logging driver as interim solution (long term is kafka driver moby/moby#18604).

So I will stick with gelf driver for now.

Hope you or other contributor will find time to fix this production critical bug.

Otherwise people will go the sam route I did.

@rorysavage77
Copy link

Any updates on this?

@tagomoris
Copy link
Member

Nothing yet. I couldn't create tests or scripts to reproduce it.
@rcsavage do you have it?

@toddlers
Copy link

@tagomoris we have reproduced . We will provide the data soon.

@tagomoris
Copy link
Member

Great to hear that! Then I'll fix it.

@rorysavage77
Copy link

That is good news. How soon for a fix?

@spanktar
Copy link

spanktar commented Aug 2, 2016

This is a show-stopper for our fluentd adoption. Any word here on progress?

@tagomoris
Copy link
Member

Please give me any code to show how to reproduce it...

@EricFortin
Copy link

EricFortin commented Sep 14, 2016

@tagomoris

I managed to reproduce it since we have been hitting this more and more recently. I am not able to reproduce it on my machine but I can make it crash every time on an AWS instance running Amazon Linux and Docker 1.11.2. The time it takes to crash is variable though. My test setup is under Docker however since I am not that good in Go(and ruby) but I guess it will be easy enough for someone on the project to replicate the setup outside of Docker.

fluent.conf

<source>
  type forward
  port 24224
  bind 0.0.0.0
</source>

<match docker.*>
  type down

  buffer_type memory
  buffer_chunk_limit 5k
  buffer_queue_limit 32
</match>

out_down.rb plugin

module Fluent

class DownOutput < BufferedOutput
  Fluent::Plugin.register_output('down', self)

  unless method_defined?(:log)
    define_method(:log) { $log }
  end

  def initialize
    super
  end

  def configure(conf)
    # override default buffer_chunk_limit so it overflows very fast.
    conf['buffer_chunk_limit'] ||= '10k'
    super
  end

  def start
    super
  end

  def shutdown
    super
  end

  def format(tag, time, record)
    [tag, time, record].to_msgpack
  end

  def write(chunk)
    count = 0
    chunk.msgpack_each { |tag, time, record|

        count += 1
    }
    log.info "Processed #{count} entries in chunk."
  end
end

end

I spin one container running Fluentd and another container with this command:

docker run -d --log-driver fluentd --log-opt fluentd-async-connect=true --name logger alpine:3.4 sh -c 'while true; do echo {\"time\":\"2016-09-13T14:49:32.082Z\", \"mylog\":\"This is my log!\"}; done'

I tried reproducing on CentOS 7 with Docker 1.11.2 as well but didn't have luck. I tried CentOS because installing Docker 1.12 seems impossible on Amazon Linux and I wanted to compare both version.

Update: I also managed to crash docker 1.11.2 on CentOS 7.
Update#2: I also managed to crash Docker 1.12.

I found a workaround for people hitting the problem that may acceptable. In fluentd-0.12.26, a config option named buffer_queue_full_action has been introduced. Setting it to drop_oldest_chunk avoid the problem for us so far but with the obvious consequence of dropping some logs. I let you decide if this is acceptable.

@EricFortin
Copy link

@tagomoris I gave a try to writing a program in Go to reproduce and succeeded. I used the same fluent.conf as previously. Here's the Go source code:

package main

import (
    "time"
    "github.com/fluent/fluent-logger-golang/fluent"
)

func main() {
    f, err := fluent.New(fluent.Config{})
    if err != nil {
        panic(err)
    }

    data := map[string]string{
        "mylog": "This is my json log!",
    }
    tm := time.Now()
    for {
        if err := f.PostWithTime("docker.logs", tm, data); err != nil {
            panic(err)
        }
    }
}

It generated the same callstack as the crash within docker:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4e785c]

goroutine 1 [running]:
panic(0x5a95c0, 0xc82000a110)
    /usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
net.(*TCPConn).Write(0x0, 0xc82037a000, 0x34e0, 0x78000, 0x0, 0x0, 0x0)
    <autogenerated>:69 +0x2c
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc82007e000, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:287 +0x1ec
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc82007e000, 0xc8202ddc70, 0x2f, 0x47)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:172 +0x189
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).EncodeAndPostData(0xc82007e000, 0x5e7640, 0xb, 0xecf6cd5f0, 0xc80c58991b, 0x6d8bc0, 0x567f00, 0xc8202deed0, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:163 +0x257
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostWithTime(0xc82007e000, 0x5e7640, 0xb, 0xecf6cd5f0, 0xc80c58991b, 0x6d8bc0, 0x568000, 0xc820082060, 0x0, 0x0)
    /home/eric/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:155 +0x998
main.main()
    /home/eric/go/src/github.com/EricFortin/fluentdCrash/logger.go:20 +0x1ea

@tagomoris
Copy link
Member

I'm trying just same code & configuration on my Mac, but it's not reproduced yet (in 10 minutes).
Is it Linux specific problem?
@EricFortin How long time you need to reproduce it?

@EricFortin
Copy link

EricFortin commented Sep 16, 2016

@tagomoris So far, I have reproduced it only in VM on AWS. I run Linux in a VM on my local machine and weren't able to reproduce it either. Since we only experienced it in AWS, where we run our stuff, I developed the test case there. It usually takes between 2-10 minutes to crash.

@tagomoris
Copy link
Member

@EricFortin Thank you for great report! I'll try it on AWS too and some patches next week.

@tagomoris tagomoris self-assigned this Sep 16, 2016
@tagomoris
Copy link
Member

@EricFortin Finally, I made it! (w/ Amazon Linux 2016.03.3)

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4e959c]

goroutine 1 [running]:
net.(*TCPConn).Write(0x0, 0xc820500000, 0x242b, 0x60000, 0x0, 0x0, 0x0)
    <autogenerated>:75 +0x2c
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send(0xc82007e000, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:287 +0x1ec
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostRawData(0xc82007e000, 0xc8203efea0, 0x2f, 0x47)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:172 +0x189
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).EncodeAndPostData(0xc82007e000, 0x5ea680, 0xb, 0xecf72d1cc, 0xc8336109e0, 0x6dea00, 0x568880, 0xc8203e7b30, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:163 +0x257
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).PostWithTime(0xc82007e000, 0x5ea680, 0xb, 0xecf72d1cc, 0xc8336109e0, 0x6dea00, 0x568980, 0xc8200106f0, 0x0, 0x0)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:155 +0x996
main.main()
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/test/main.go:19 +0x1e9

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/golang/src/runtime/asm_amd64.s:1721 +0x1

goroutine 32359 [semacquire]:
sync.runtime_Semacquire(0xc82007e0a0)
    /usr/lib/golang/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc82007e09c)
    /usr/lib/golang/src/sync/mutex.go:82 +0x1c4
github.com/fluent/fluent-logger-golang/fluent.(*Fluent).reconnect.func1(0xc82007e000)
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:255 +0x6a
created by github.com/fluent/fluent-logger-golang/fluent.(*Fluent).reconnect
    /home/ec2-user/go/src/github.com/fluent/fluent-logger-golang/fluent/fluent.go:267 +0x35

@nikolai-derzhak-distillery

👍

@tagomoris
Copy link
Member

tagomoris commented Sep 20, 2016

I updated the code not to panic but retries to send message, with showing err content.

package main

import (
    "fmt"
    "time"
    "github.com/fluent/fluent-logger-golang/fluent"
)

func main() {
    f, err := fluent.New(fluent.Config{})
    if err != nil {
        panic(err)
    }

    data := map[string]string{
        "mylog": "This is my json log!",
    }
    for {
        tm := time.Now()
        if err := f.PostWithTime("docker.logs", tm, data); err != nil {
            fmt.Printf("error: %s\n", err);
            continue;
        }
    }
}

With the change of #30, I got:

  1. "write tcp 127.0.0.1:36746->127.0.0.1:24224: write: connection reset by peer" at first
  2. "fluent#send: can't send logs, client is reconnecting" for a while
  3. "Buffer full" at last

With this behavior, fluent-logger-golang (and fluentd logging driver of docker) doesn't stop Docket itself.

@EricFortin
Copy link

Nice !

@tagomoris
Copy link
Member

I created a pull-request for docker to solve this problem!
moby/moby#26740

@tagomoris
Copy link
Member

That was merged.
Look forward to Docker 1.13.0 release!

@hridyeshpant
Copy link

@EricFortin can you give your configuration to use buffer_queue_full_action.
Currently i am using below configuration in fluentd setting
<match docker.**>
type elasticsearch
logstash_format true
include_tag_key true
tag_key image
flush_interval 10s

do you mean adding below paramter ?
buffer_queue_full_action drop_oldest_chunk

@EricFortin
Copy link

@hridyeshpant Yes you add that to your elasticsearch plugin config. It is available because this plugin inherits BufferedOutput. I am using flume on my side but since it also inherits from BufferedOutput, I can use this settings. Here's my config:

<match docker.*>
  type flume
  host FLUME_HOST
  port FLUME_PORT
  remove_prefix docker

  buffer_queue_full_action drop_oldest_chunk
</match>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants