New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_broker_main() crashed #300

Closed
gservat opened this Issue Jul 12, 2018 · 21 comments

Comments

Projects
None yet
3 participants
@gservat

gservat commented Jul 12, 2018

I'm mind-blown by the speed increase we've had with mitogen for some of our playbooks (going from ~40 minutes to ~7 minutes!!). The only issue we're facing is that when the playbook ends (seemingly successful), it blurts out:

ERROR! [pid 12472] 14:59:12.774005 E mitogen: _broker_main() crashed
Traceback (most recent call last):
  File "/Users/gservat/Downloads/mitogen-stable/mitogen/core.py", line 1788, in _broker_main
    self._loop_once(max(0, deadline - time.time()))
  File "/Users/gservat/Downloads/mitogen-stable/mitogen/core.py", line 1774, in _loop_once
    for (side, func) in self.poller.poll(timeout):
  File "/Users/gservat/Downloads/mitogen-stable/mitogen/parent.py", line 560, in poll
    changelist, 32, timeout)
  File "/Users/gservat/Downloads/mitogen-stable/mitogen/core.py", line 287, in io_op
    return func(*args), False
OSError: [Errno 2] No such file or directory

Any ideas?

Python: 2.7.14

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

@gservat

This comment has been minimized.

gservat commented Jul 12, 2018

Versions:

  • OS: Mac OS/X 10.13.5
  • Python: 2.7.14
  • Ansible: 2.5.4

Seems to happen every time, yeah. If I only do one node (-l <host>) then it doesn't error out, but if I do a whole bunch of hosts then it does end with the _broker_main() crashed.
The only thing we enable in ansible.cfg is redis fact caching.

Where should I send the logs? Just attach here or mail it somewhere?

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

@gservat

This comment has been minimized.

gservat commented Jul 12, 2018

They look innocent to me (here you go).

Intel i7 CPU so 4 cores. CPUs are running hard but sadly not making any money with them :)

In terms of target size, there's 185 hosts. If I run it against a smaller batch (I tried with 23 hosts) then it seems to finish without errors.

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

Please leave this with me for a day or two. I need to set up a test environment to see if those 'no route' errors are related -- that is usually a harmless race manifesting once or twice, but in this log it is appearing a ton of times. There may be a nasty shutdown ordering bug in here somewhere

@gservat

This comment has been minimized.

gservat commented Jul 12, 2018

Sure, no worries. The only other issue I've noticed (I can open a different issue if you like) is that I've been running the same playbook today against the same number of targets (180+ hosts) and I've noticed that it just stopped at one point. I had 180+ SSH processes on my workstation, and after several minutes the playbook was going nowhere. I ended up cancelling the playbook run.

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

That definitely sounds like another race. Let me get an environment similar to yours and reproduce it -- I haven't tested much higher than the 80 target mark. These are usually really simple to fix once identified, but sometimes they can be difficult to tickle.

It's entirely possible "-vvv" output will reveal another source of the hang, however if it is a race, enabling "-vvv" has a very high probability of hiding it. Incredibly frustrating :) So let me try finding it first on my end.

@gservat

This comment has been minimized.

gservat commented Jul 12, 2018

Thanks very much for looking into this.

@RonnyPfannschmidt

This comment has been minimized.

RonnyPfannschmidt commented Jul 12, 2018

wouldnt still trying with -vvv enable at least looking for a indication of another reason

@gservat

This comment has been minimized.

gservat commented Jul 12, 2018

I also tried setting - serial: 25% in the playbook and ran it against the 183 hosts. The number of SSH processes went from 25% of 183 to 50% and eventually to 183 SSH processes. It's like it's not closing the old ones when they're done?

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

@gservat maintaining the SSH connections is part of the design -- they're a persistent link to a fixed remote process, closing them would kill a lot of the perf benefit.

Indeed @RonnyPfannschmidt :) That wasn't very clear -- of course feel free to look at "-vvv" output, in particular any red "E" lines. They are hidden by default because the library emits so many soft errors (such as those 'no route' messages), however it's entirely possible something might stick out like a sore thumb.

@dw

This comment has been minimized.

Owner

dw commented Jul 12, 2018

Just for @RonnyPfannschmidt 's benefit, since I know you're evaluating this presently: the 'no route' messages are exclusively relate to handle 102, aka. mitogen.core.FORWARD_LOG. There is some annoying ordering issue during disconnection where sometimes one or two log lines get dropped. It hasn't been a priority to investigate since those logs are always captured on disk anyway via MITOGEN_ROUTER_DEBUG (Router.enable_debug()), so it's mostly an inconvenience just now.

@gservat

This comment has been minimized.

gservat commented Jul 17, 2018

@dw any luck? If you had time to have a look, that is.

@dw

This comment has been minimized.

Owner

dw commented Jul 17, 2018

Hi @gservat, I hope to have time to setup a test environment before the end of the week.. sorry, preparing to travel just now, and it's usually quite involved to find a way to tickle it :)

Sorry for the delay!

@gservat

This comment has been minimized.

gservat commented Jul 17, 2018

No worries @dw! Thanks for the update.

@dw

This comment has been minimized.

Owner

dw commented Jul 24, 2018

Still haven't gotten to this -- real soon now, I promise ;)

@gservat

This comment has been minimized.

gservat commented Jul 24, 2018

@dw

This comment has been minimized.

Owner

dw commented Jul 26, 2018

This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://www.freelists.org/list/mitogen-announce

Thanks again for reporting this!

@gservat

This comment has been minimized.

gservat commented Jul 27, 2018

Thanks @dw !!

@gservat

This comment has been minimized.

gservat commented Aug 7, 2018

Hey @dw just wanted to confirm that v0.2.2 of mitogen seems to solve the issue for me 😄Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment