Reconnect if there's a DRB error in IPC #39

thijsc · 2015-05-04T09:54:09Z

No description provided.

serioushaircut · 2015-05-04T13:36:25Z

Hi,
okay we'll try to build a simple setup with a long running worker and see what happens and whether it depends on the run time duration.

But anyway, my guess is in case of a systematic error occurring during the writing operation on this file stream, this would result in an infinite loop? Did I have that right?
I'd really like to figure out why the unix socket peer (= Appsignal::IPC::Server ?) is communicating a broken pipe to the calling side (=..Appsignal::IPC::Client ?). Long idle times shouldn't pose a problem afaik.

Cheers from Hamburg
SH

serioushaircut · 2015-05-05T09:16:46Z

Update: A simple job with a long runtime does not seem to trigger the broken pipe error.
I just tested a resque job that sleeps 15, 30, 45, 50 minutes (the latter two being the runtimes of the crashing jobs) and the Drb error could not be observed.
Investigating...

thijsc · 2015-05-05T09:28:01Z

Thanks for the update and testing it! We're wrapping up work on a new release that will replace the current IPC system at the moment, so I'm happy that we have a non-invasive fix :-). Do I understand correctly that this fix does work for you?

serioushaircut · 2015-05-05T10:12:27Z

No. Not yet. The drbconnerror happens regularly on production, with a job handling a lot of data and running for a long time.
We just forked appsignal, applied the tentative fix and are just deploying that version on one of our worker servers to trigger the problematic job again. Then we'll see whether the fix works.
Honestly, I doubt it, as a retry on a unix domain socket that has been closed by the other peer will certainly result in the same error. But we'll see shortly. I hope, I am wrong there :)
Cheers
SH

thijsc · 2015-05-05T10:13:46Z

I think the fix should work since the listener lives on the filesystem and the reconnect is a full reconnect. Let's see :-).

serioushaircut · 2015-05-05T13:07:51Z

Hey thijs,
production test done, got the very same error, just now from the enqueue line in the rescue block :(.

DRb::DRbConnError
Error Broken pipe
/usr/lib/ruby/2.1.0/drb/drb.rb:611:in `write'
/usr/lib/ruby/2.1.0/drb/drb.rb:611:in `send_request'
/usr/lib/ruby/2.1.0/drb/drb.rb:921:in `send_request'
/usr/lib/ruby/2.1.0/drb/drb.rb:1221:in `send_message'
/usr/lib/ruby/2.1.0/drb/drb.rb:1110:in `block (2 levels) in method_missing'
/usr/lib/ruby/2.1.0/drb/drb.rb:1197:in `open'
/usr/lib/ruby/2.1.0/drb/drb.rb:1109:in `block in method_missing'
/usr/lib/ruby/2.1.0/drb/drb.rb:1128:in `with_friend'
/usr/lib/ruby/2.1.0/drb/drb.rb:1108:in `method_missing'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/ipc.rb:64:in `rescue in enqueue'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/ipc.rb:59:in `enqueue'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/transaction.rb:174:in `complete!'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/transaction.rb:26:in `complete_current!'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal.rb:92:in `monitor_transaction'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/integrations/resque.rb:8:in `around_perform_resque_plugin'

I try to figure out what is wrong with socket peer - but it seems any write into the stream from the Appsignal::IPC::Client side fails with Broken pipe.
Cheers
SH

serioushaircut · 2015-05-05T13:27:26Z

And a bit more knowledge:
We saw the problematic jobs with this error just twice. The unix domain sockets we observed on our ubuntu instance via

> cat /proc/net/unix
Num       RefCount Protocol Flags    Type St Inode Path
...
0000000000000000: 00000002 00000000 00010000 0001 01 201748 /app/tmp/appsignal-24
...
0000000000000000: 00000002 00000000 00010000 0001 01 201150 /app/tmp/appsignal-21
...

24 and 21 were the ones used by the problem jobs that failed with broken pipe. And we could verify that other jobs afterwards were processed by the same workers on the same sockets (24, 21) successfully.

Mmh not much better, I have to dig into the two jobs that cause problem again. Perhaps something with serializing the messages into the socket stream buffer. And again investigating. Any ideas from your side?

Cheers
SH

thijsc · 2015-05-05T15:48:50Z

I'm starting to think your issue might not be a disconnection, but something that cannot be serialized by DRB. Are you using any hashes with default options for example?

serioushaircut · 2015-05-06T08:31:08Z

Moin,
that's on the line what we guessed about serializing the transaction.
Here we have a large transaction - the job works with a lot of db entries - we monkey patched the appsignal version and logged our transaction object #to_s to have a size of 71MB.
When we clear the events before enqueuing the transaction object, it works. But I'm gonna test the size hypothesis with a simple job that does a lot of queries => lot of appsignal events and nothing else.

As to your hint with Hash with default options:
We do not use such a hash within the job ourselves. But the code within the job is not obviously subject to any marshalling via the socket anyway - that depends more on whatever is instrumented by appsignal and hence put into the transaction object.
(btw we grepped for 'Hash' in the object#to_s file and didn't find anything)

I'll give it a try with a job that queries a lot and see whether I can find out whether it depends on the pure size of the transaction object.

Cheers
SH

serioushaircut · 2015-05-06T09:03:52Z

Hey,
test successful - following job triggers the stacktrace.

class LargeTransactionJob < BaseJob
  def self.queue
    :stats
  end

  def self.perform
    50000.times{p User.find_by(first_name: "dontexist")}
  end
end

As we don't do anything special in the job, we are quite certain it's just the size of transaction object that keeps track of all queries done in the job - too large for the uds stream?!
What do you say?
Cheers
SH

thijsc · 2015-05-06T15:01:56Z

I'm going to run some tests on this in the morning, will get back to you then.

thijsc · 2015-05-07T11:52:22Z

I've been doing some testing locally and haven't hit any size limits so far. Still not exactly sure what's going wrong in this situation. We're getting close to a beta of the new gem version which completely overhauls this. Are you interested in participating in the beta? If so I'll ping you via e-mail with details.

serioushaircut · 2015-05-07T13:34:26Z

Hi, mmh strange.
Perhaps it's a good idea to keep us inline with the beta. Until then we're gonna live with clearing the events field of the current thread's transaction as last instruction in our jobs that accumulate too many of them.
We're gonna give the beta a try in this respect.
Cheers and thanks for your help
SH

thijsc · 2015-05-07T13:36:42Z

You're welcome. I'll be in touch.

Reconnect if there's a DRB error in IPC

3c91339

thijsc mentioned this pull request May 4, 2015

Long running (resque) processes trigger DRb Conn Error with Broken Pipe #38

Closed

thijsc mentioned this pull request May 5, 2015

added ipc client enqueue retry on drbconnerror #40

Closed

thijsc closed this May 7, 2015

serioushaircut mentioned this pull request Aug 5, 2015

DrbConnError when Transaction Events Array too large #56

Closed

tombruijn deleted the drb_reconnect branch October 14, 2016 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect if there's a DRB error in IPC #39

Reconnect if there's a DRB error in IPC #39

thijsc commented May 4, 2015

serioushaircut commented May 4, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 5, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 6, 2015

serioushaircut commented May 6, 2015

thijsc commented May 6, 2015

thijsc commented May 7, 2015

serioushaircut commented May 7, 2015

thijsc commented May 7, 2015

Reconnect if there's a DRB error in IPC #39

Reconnect if there's a DRB error in IPC #39

Conversation

thijsc commented May 4, 2015

serioushaircut commented May 4, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 5, 2015

serioushaircut commented May 5, 2015

thijsc commented May 5, 2015

serioushaircut commented May 6, 2015

serioushaircut commented May 6, 2015

thijsc commented May 6, 2015

thijsc commented May 7, 2015

serioushaircut commented May 7, 2015

thijsc commented May 7, 2015