Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mitogen tracebacks with 0.2.4 and 0.2.5 but works fine with 0.2.3 #545

Closed
arrfab opened this Issue Feb 16, 2019 · 9 comments

Comments

Projects
None yet
2 participants
@arrfab
Copy link

commented Feb 16, 2019

  • Which version of Ansible are you running?
    ansible-2.6.5-1.el7.noarch

  • Is your version of Ansible patched in any way?
    default upstream

  • Are you running with any custom modules, or module_utils loaded?
    no

  • Have you tried the latest master version from Git?
    no

  • Do you have some idea of what the underlying problem may be?
    https://mitogen.rtfd.io/en/stable/ansible.html#common-problems has
    instructions to help figure out the likely cause and how to gather relevant
    logs.

It seems the pattern is the target arch : armhfp , while it works fine with other arches like x86_64, ppc64e, aarch64

  • Mention your host and target OS and versions
    CentOS 7 everywhere

  • Mention your host and target Python versions
    same : python-2.7.5-76.el7.x86_64

Here is the log when trying to "ping" with ansible some armhfp nodes :

armhfp-01.sub.domain.com | UNREACHABLE! => {
"changed": false,
"msg": "EOF on stream; last 300 bytes received: u'n\n File "", line 552, in _profile_hook\n File "", line 3081, in _dispatch_calls\n File "", line 1035, in iter\n File "", line 1021, in get\n File "", line 2185, in get\n File "", line 2150, in _make_cookie\nerror: integer out of range for \'l\' format code'",
"unreachable": true
}
armhfp-03.sub.domain.com | UNREACHABLE! => {
"changed": false,
"msg": "EOF on stream; last 300 bytes received: u'n\n File "", line 552, in _profile_hook\n File "", line 3081, in _dispatch_calls\n File "", line 1035, in iter\n File "", line 1021, in get\n File "", line 2185, in get\n File "", line 2150, in _make_cookie\nerror: integer out of range for \'l\' format code'",
"unreachable": true
}
ERROR! [pid 25435] 16:49:32.020977 E mitogen.ctx.ssh.armhfp-01.domain.com: mitogen: ExternalContext.main() crashed
Traceback (most recent call last):
File "", line 3354, in main
File "", line 3093, in run
File "", line 552, in _profile_hook
File "", line 3081, in _dispatch_calls
File "", line 1035, in iter
File "", line 1021, in get
File "", line 2185, in get
File "", line 2150, in _make_cookie
error: integer out of range for 'l' format code
armhfp-02.domain.com | UNREACHABLE! => {
"changed": false,
"msg": "EOF on stream; last 300 bytes received: u'n\n File "", line 552, in _profile_hook\n File "", line 3081, in _dispatch_calls\n File "", line 1035, in iter\n File "", line 1021, in get\n File "", line 2185, in get\n File "", line 2150, in _make_cookie\nerror: integer out of range for \'l\' format code'",
"unreachable": true
}
armhfp-02.sub.domain.com | UNREACHABLE! => {
"changed": false,
"msg": "EOF on stream; last 300 bytes received: u'n\n File "", line 552, in _profile_hook\n File "", line 3081, in _dispatch_calls\n File "", line 1035, in iter\n File "", line 1021, in get\n File "", line 2185, in get\n File "", line 2150, in _make_cookie\nerror: integer out of range for \'l\' format code'",
"unreachable": true
}
armhfp-01.domain.com | UNREACHABLE! => {
"changed": false,
"msg": "Channel was disconnected while connection attempt was in progress; this may be caused by an abnormal Ansible exit, or due to an unreliable target.",
"unreachable": true

Worth noting that switching back to normal strategy through ansible.cfg permits to ping such nodes fine

Happy to give me details if needed

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

Oh heck, thanks so much for this. Yes, there is performance work where a large text string was encoded in binary instead.

So it looks like an incorrect type is used for one of the encoded fields in https://github.com/dw/mitogen/blob/master/mitogen/core.py#L2143 .. reducing this to a particular architecture is a huge help!

My guess is because the struct format is not prefixed with '>', some size on ARM is varying. I'll setup al ocal Qemu and reproduce.

Thanks a ton for finding this!

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

IIRC armhfp is 32-bit, right? Looks like we're trying to stuff a pointer (probably thread.get_ident()) that's been hashed with a 64-bit seed into a 32-bit word.

@arrfab

This comment has been minimized.

Copy link
Author

commented Feb 16, 2019

yes, armhfp (aka ARMv7) is 32bits only, but the running kernel on those armhfp nodes are PAE (don't know if that makes a diff)

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

I am struggling to find a Qemu ARM image to hand. If convenient, would you mind starting a Python interpreter on one of those nodes and pasting:

import struct, thread
print('L:', struct.calcsize('L'))
print('l:', struct.calcsize('l'))
print('get_ident: ', thread.get_ident())
print('id: ', id(None))
@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

looks like scaleway.com's C1 server is a match, at least I can test this :)

dw added a commit that referenced this issue Feb 16, 2019

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

Was able to reproduce on the C1, it just needed the field lengths to be increased. This will be on master after CI runs

@arrfab

This comment has been minimized.

Copy link
Author

commented Feb 16, 2019

You're really too fast : I just had a quick dinner and wanted to paste the results :

import struct, thread
print('L:', struct.calcsize('L'))
('L:', 4)
print('l:', struct.calcsize('l'))
('l:', 4)
print('get_ident: ', thread.get_ident())
('get_ident: ', -1225694400)
print('id: ', id(None))
('id: ', 3068956472L)

@arrfab

This comment has been minimized.

Copy link
Author

commented Feb 16, 2019

I applied the diff and I confirm it now works just fine ! Thanks a lot !
Btw, this line (de5c22d#diff-b0508adba031c447ece5a495a136a3d1R160) should be probably arrfab instead of that typo, but I don't need to be mentioned either ;-)

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 16, 2019

it's fixed in the subsequent force push already :) Thanks for spotting!

@dw dw closed this in 7d0480e Feb 16, 2019

dw added a commit that referenced this issue Feb 16, 2019

Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  core: increase cookie field lengths to 64-bit; closes #545.
  tests: ensure serialization restrictions are in effect
  tests/bench: set process affinity in throughput.py.
  docs: update copyright year.
  docs: update Changelog.
  core: Make Latch.put(obj=) optional.

dw added a commit that referenced this issue Mar 6, 2019

Merge remote-tracking branch 'origin/026' into stable
* origin/026:
  docs: update Changelog for release.
  Bump version for release.
  issue #555: ansible: workaround ancient reload(sys) hack.
  issue #554: mitogen_action_script fix
  issue #554: fix Ansible 2.4 compatibility
  issue #554: don't rely on tmp_path autoremoval in test.
  issue #554: track and remove multiple make_tmp_path() calls.
  docs: update Changelog.
  docs: drastically simplify install/changelog.
  issue #552: include process identity in log messages.
  issue #550: update Changelog.
  issue #550: parent: add explanatory comment.
  issue #550: fix up TTY ioctls on WSL 2016 Anniversary Update
  docs: update Changelog.
  service: make service list optional.
  docs: update Changelog; closes #548.
  issue #548: always treat transport=smart as 'ssh' for mitogen_via=.
  docs: better intro paragraph.
  .ci: copy private key file to tempdir.
  os_fork: more doc tweaks
  os_fork: more doc tweaks
  os_fork: yet more doc tidyup
  os_fork: more doc tweaks
  os_fork: clean up docs
  .ci: import soak scripts.
  .ci: allow containers for different jobs to run simultaneously
  os_fork: python 3 fixes and tests.
  issue #535: activate Corker on 2.4 in master too.
  issue #535: update Changelog.
  issue #535: wire mitogen.os_fork into Broker and Pool.
  issue #535: parent: add create_socketpair(size=..) parameter.
  issue #535: introduce mitogen.os_fork module and Corker class.
  issue #535: docs: update Changelog
  issue #535: service: support Pool.defer() like Broker.defer()
  issue #535: core: unicode.encode() may take importer lock on 2.x
  issue #535: docs: fix up Select doc
  issue #535: docs: update Changelog.
  issue #535: core/select: support selecting from Latches.
  core: increase cookie field lengths to 64-bit; closes #545.
  tests: ensure serialization restrictions are in effect
  tests/bench: set process affinity in throughput.py.
  docs: update copyright year.
  docs: update Changelog.
  core: Make Latch.put(obj=) optional.
  docs: change 'unreleased' Changelog format and add a hint.
  docs: update Changelog; closes #542.
  issue #542: return of select poller, new selection logic
  issue #542: .ci: move some tests to Azure and enable Mac job.
  ansible: create stub __init__.py for sdist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.