Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect if there's a DRB error in IPC #39

Closed
wants to merge 1 commit into from
Closed

Conversation

thijsc
Copy link
Member

@thijsc thijsc commented May 4, 2015

No description provided.

@serioushaircut
Copy link

Hi,
okay we'll try to build a simple setup with a long running worker and see what happens and whether it depends on the run time duration.

But anyway, my guess is in case of a systematic error occurring during the writing operation on this file stream, this would result in an infinite loop? Did I have that right?
I'd really like to figure out why the unix socket peer (= Appsignal::IPC::Server ?) is communicating a broken pipe to the calling side (=..Appsignal::IPC::Client ?). Long idle times shouldn't pose a problem afaik.

Cheers from Hamburg
SH

@serioushaircut
Copy link

Update: A simple job with a long runtime does not seem to trigger the broken pipe error.
I just tested a resque job that sleeps 15, 30, 45, 50 minutes (the latter two being the runtimes of the crashing jobs) and the Drb error could not be observed.
Investigating...

@thijsc
Copy link
Member Author

thijsc commented May 5, 2015

Thanks for the update and testing it! We're wrapping up work on a new release that will replace the current IPC system at the moment, so I'm happy that we have a non-invasive fix :-). Do I understand correctly that this fix does work for you?

@serioushaircut
Copy link

No. Not yet. The drbconnerror happens regularly on production, with a job handling a lot of data and running for a long time.
We just forked appsignal, applied the tentative fix and are just deploying that version on one of our worker servers to trigger the problematic job again. Then we'll see whether the fix works.
Honestly, I doubt it, as a retry on a unix domain socket that has been closed by the other peer will certainly result in the same error. But we'll see shortly. I hope, I am wrong there :)
Cheers
SH

@thijsc
Copy link
Member Author

thijsc commented May 5, 2015

I think the fix should work since the listener lives on the filesystem and the reconnect is a full reconnect. Let's see :-).

@serioushaircut
Copy link

Hey thijs,
production test done, got the very same error, just now from the enqueue line in the rescue block :(.

DRb::DRbConnError
Error Broken pipe
/usr/lib/ruby/2.1.0/drb/drb.rb:611:in `write'
/usr/lib/ruby/2.1.0/drb/drb.rb:611:in `send_request'
/usr/lib/ruby/2.1.0/drb/drb.rb:921:in `send_request'
/usr/lib/ruby/2.1.0/drb/drb.rb:1221:in `send_message'
/usr/lib/ruby/2.1.0/drb/drb.rb:1110:in `block (2 levels) in method_missing'
/usr/lib/ruby/2.1.0/drb/drb.rb:1197:in `open'
/usr/lib/ruby/2.1.0/drb/drb.rb:1109:in `block in method_missing'
/usr/lib/ruby/2.1.0/drb/drb.rb:1128:in `with_friend'
/usr/lib/ruby/2.1.0/drb/drb.rb:1108:in `method_missing'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/ipc.rb:64:in `rescue in enqueue'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/ipc.rb:59:in `enqueue'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/transaction.rb:174:in `complete!'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/transaction.rb:26:in `complete_current!'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal.rb:92:in `monitor_transaction'
/app/.bundle/ruby/2.1.0/bundler/gems/appsignal-199271a156d4/lib/appsignal/integrations/resque.rb:8:in `around_perform_resque_plugin'

I try to figure out what is wrong with socket peer - but it seems any write into the stream from the Appsignal::IPC::Client side fails with Broken pipe.
Cheers
SH

@serioushaircut
Copy link

And a bit more knowledge:
We saw the problematic jobs with this error just twice. The unix domain sockets we observed on our ubuntu instance via

> cat /proc/net/unix
Num       RefCount Protocol Flags    Type St Inode Path
...
0000000000000000: 00000002 00000000 00010000 0001 01 201748 /app/tmp/appsignal-24
...
0000000000000000: 00000002 00000000 00010000 0001 01 201150 /app/tmp/appsignal-21
...

24 and 21 were the ones used by the problem jobs that failed with broken pipe. And we could verify that other jobs afterwards were processed by the same workers on the same sockets (24, 21) successfully.

Mmh not much better, I have to dig into the two jobs that cause problem again. Perhaps something with serializing the messages into the socket stream buffer. And again investigating. Any ideas from your side?

Cheers
SH

@thijsc
Copy link
Member Author

thijsc commented May 5, 2015

I'm starting to think your issue might not be a disconnection, but something that cannot be serialized by DRB. Are you using any hashes with default options for example?

@serioushaircut
Copy link

Moin,
that's on the line what we guessed about serializing the transaction.
Here we have a large transaction - the job works with a lot of db entries - we monkey patched the appsignal version and logged our transaction object #to_s to have a size of 71MB.
When we clear the events before enqueuing the transaction object, it works. But I'm gonna test the size hypothesis with a simple job that does a lot of queries => lot of appsignal events and nothing else.

As to your hint with Hash with default options:
We do not use such a hash within the job ourselves. But the code within the job is not obviously subject to any marshalling via the socket anyway - that depends more on whatever is instrumented by appsignal and hence put into the transaction object.
(btw we grepped for 'Hash' in the object#to_s file and didn't find anything)

I'll give it a try with a job that queries a lot and see whether I can find out whether it depends on the pure size of the transaction object.

Cheers
SH

@serioushaircut
Copy link

Hey,
test successful - following job triggers the stacktrace.

class LargeTransactionJob < BaseJob
  def self.queue
    :stats
  end

  def self.perform
    50000.times{p User.find_by(first_name: "dontexist")}
  end
end

As we don't do anything special in the job, we are quite certain it's just the size of transaction object that keeps track of all queries done in the job - too large for the uds stream?!
What do you say?
Cheers
SH

@thijsc
Copy link
Member Author

thijsc commented May 6, 2015

I'm going to run some tests on this in the morning, will get back to you then.

@thijsc
Copy link
Member Author

thijsc commented May 7, 2015

I've been doing some testing locally and haven't hit any size limits so far. Still not exactly sure what's going wrong in this situation. We're getting close to a beta of the new gem version which completely overhauls this. Are you interested in participating in the beta? If so I'll ping you via e-mail with details.

@serioushaircut
Copy link

Hi, mmh strange.
Perhaps it's a good idea to keep us inline with the beta. Until then we're gonna live with clearing the events field of the current thread's transaction as last instruction in our jobs that accumulate too many of them.
We're gonna give the beta a try in this respect.
Cheers and thanks for your help
SH

@thijsc
Copy link
Member Author

thijsc commented May 7, 2015

You're welcome. I'll be in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants