Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread Errors - Deadlock? Other source? #143

Closed
GLStephen opened this issue Aug 24, 2017 · 7 comments
Closed

Thread Errors - Deadlock? Other source? #143

GLStephen opened this issue Aug 24, 2017 · 7 comments

Comments

@GLStephen
Copy link
Contributor

GLStephen commented Aug 24, 2017

I'm seeing errors in fluent-plugin-bigquery that seem similar to deadlock issues that have been reporting in general with FluentD 14. However, since the errors are specifically referencing fluent-plugin-bigquery I wanted to drop them in here. Things to know: Restart takes a long time and finally dies with these errors.

2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidWon] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a26fc20@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidResponse] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f264245c210@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a25e3f8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidTimeout] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f2642466a08@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007efd4a23f0e8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_2 thread=#<Thread:0x007f264242faa8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_2 thread=#<Thread:0x007efd4a245718@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_3 thread=#<Thread:0x007efd4a244368@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::ForwardInput title=:event_loop thread=#<Thread:0x007efd4a237c08@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidResponse] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a267c78@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007efd4a247450@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidTimeout] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a27b958@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a24e1d8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::ForwardInput title=:event_loop thread=#<Thread:0x007f26424136f0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f2642435020@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_3 thread=#<Thread:0x007f264242eec8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f2642434508@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f264244dcd8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidWon] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f26424647d0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007f264244d3a0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
@joker1007
Copy link
Collaborator

I have never seen this error.
I think that fluent-plugin-bigquery does not cause this.
If bigquery itself is unstable, flush thread of fluent-plugin-bigquery has long wait. It cause this error maybe.
Would you give me your configuration and plugin version.

@GLStephen
Copy link
Contributor Author

GLStephen commented Aug 28, 2017

fluent-plugin-biquery version v1.0.0 with this manually patched eaec2d1

With FluentD 14 (td-agent 3)

Failue happens whether using multiple workers or just one. If I remove the customer_id buffer param then it runs fine.

I have 5 eventType handlers like these. One streams, the others load. If flush interval is set higher the system fails faster. It eventually locks up and Forwarders see the Aggregator (config below) as down.

<match prod.push.eventType>
  @type copy

  <store>
    @type bigquery
    @id eventType
    auth_method application_default    
    
    project project
    dataset dataset
    table eventType_${customer_id}$%Y%m%d
    
    ignore_unknown_values true
    schema_path /etc/td-agent/eventType.json
        
    <buffer customer_id, time>          
      @type file
      timekey 1d
      flush_interval 10s
      flush_thread_count 4
      total_limit_size 10GB    
      timekey_use_utc   
    </buffer>
    
    <secondary bqguid>
        @type file # or forward
        path /var/log/td-agent/buffer/failed/bigquery.buffer.eventType.${customer_id}.*.log
     </secondary>            
  </store>
  <store>
    @type flowcounter
    tag prod.metric.eventType
    unit minute
    aggregate all
  </store>
</match>

4 identical load types

<match prod.push.eventType>
  @type copy
  
  <store>
    @type bigquery
    @id eventTypeBQ
    auth_method application_default

    method load

    project project
    dataset dataset
    table table_${customer_id}$%Y%m%d
    
    #fetch_schema true
    schema_path /etc/td-agent/table.json
    max_bad_records 5
    ignore_unknown_values true

    slow_flush_log_threshold 600s

    <buffer customer_id, time>
      @type file
      timekey 1d
      flush_interval 1200s
      flush_at_shutdown true      
      flush_thread_count 2
      timekey_use_utc
    </buffer>
     <secondary bqguid>
        @type file # or forward
        path /var/log/td-agent/buffer/failed/bigquery.buffer.table.${customer_id}.*.log
     </secondary>
  </store>
  <store>
    @type flowcounter
    tag prod.metric.eventType
    unit minute
    aggregate all
  </store>
</match>

Forwarder looks like this. The aggregator seizes up and fails regardless of flush_interval and thread setting.

# Log Forwarding
<match prod.**>
  type forward

  <server>
    host x.x.x.x    
    port 24225
    weight 60
  </server>
  
  <server>
    host x.x.x.y
    port 24225
    weight 60
  </server>
  
  <server>
    host x.x.x.z
    port 24225
    weight 60
  </server>

  #
  # use udp for heartbeat
  heartbeat_type udp

  flush_interval .1s
  buffer_chunk_limit 999k
  buffer_queue_limit 60240

  # use multi-threading to send buffered data in parallel
  num_threads 32
  expire_dns_cache 600

  # use file buffer to buffer events on disks.
  buffer_type file
  buffer_path /var/log/td-agent/buffer/forward

  # in case buffer becomes full, have local backup
  <secondary>
    type file
    path /var/log/td-agent/buffer/secondary
    compress gzip
  </secondary>
</match>

@GLStephen
Copy link
Contributor Author

GLStephen commented Sep 13, 2017

To provide some more info. I changed messages to have the customer_id in the tag and to route based on that. So, instead of using message data only the tag is used for processing. This issue has since gone away. It seems that the deadlock occurs when the messages are forwarded undifferentiated and have to be split at the aggregator level.

@GLStephen
Copy link
Contributor Author

I believe I've found the underlying source of the thread issues, I'm just not sure if it is a fluentd issue or this plugin. What I've found is that if you have a lot of reorganization of forwarded chunks then fluend becomes unstable and has many issues. When I have my chunks all aligned to 999k everything runs very well. I tried reducing the forwarded chunks to 100k on a few test servers and the system became unstable. Changing back to 999k all around restored stability. It seems something in the rechunking logic is causing deadlocks or other issues.

@joker1007
Copy link
Collaborator

Hmm..
Thanks for your report!

@GLStephen
Copy link
Contributor Author

I believe this may have been the source of this issue: fluent/fluentd#1549

@joker1007
Copy link
Collaborator

We had similar issue, but fluentd-0.14.22 resolves it.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants