Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3: ~100MB/sec vs. 800MB/sec FileService throughput #485

Open
dw opened this issue Jan 26, 2019 · 31 comments
Open

Python 3: ~100MB/sec vs. 800MB/sec FileService throughput #485

dw opened this issue Jan 26, 2019 · 31 comments
Labels
bug Code feature that hinders desired execution outcome performance Performance problems

Comments

@dw
Copy link
Member

dw commented Jan 26, 2019

Recorded using tests/bench/throughput.py , either running with python_path=["perf", "record", "-g", "python"] / perf report, or with enable_profiling=True to capture the cProfile output. All methods except .local() are commented out


18:07:51 eldil!3 mitogen] python tests/bench/throughput.py 
Testing local()...
local() took 553.16 ms to transfer 512.00 MiB, 925.59 MiB/s
Testing sudo()...
sudo() took 523.43 ms to transfer 512.00 MiB, 978.17 MiB/s
Testing ssh(compression=False)...
ssh(compression=False) took 2313.06 ms to transfer 512.00 MiB, 221.35 MiB/s
Testing ssh(compression=True)...
ssh(compression=True) took 20117.23 ms to transfer 512.00 MiB, 25.45 MiB/s

[18:08:20 eldil!4 mitogen] PYTHONPATH=. python3 tests/bench/throughput.py 
Testing local()...
local() took 4608.32 ms to transfer 512.00 MiB, 111.10 MiB/s
Testing sudo()...
sudo() took 5030.14 ms to transfer 512.00 MiB, 101.79 MiB/s
Testing ssh(compression=False)...
ssh(compression=False) took 8944.76 ms to transfer 512.00 MiB, 57.24 MiB/s
Testing ssh(compression=True)...
ssh(compression=True) took 75505.49 ms to transfer 512.00 MiB, 6.78 MiB/s

Some insane string encoding somewhere. Py2:

screenshot from 2019-01-26 19-08-33

Py3:

screenshot from 2019-01-26 19-08-35

Traces are from receiver only

@dw dw added the bug Code feature that hinders desired execution outcome label Jan 26, 2019
@dw
Copy link
Member Author

dw commented Jan 26, 2019

Main thread (Py3):

Log

         395829 function calls (395708 primitive calls) in 4.936 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.936    4.936 <stdin>:2991(_dispatch_calls)
        2    0.000    0.000    4.934    2.467 <stdin>:2974(_dispatch_one)
        1    0.000    0.000    4.887    4.887 master:tests/bench/throughput.py:18(transfer)
        1    0.102    0.102    4.887    4.887 master:/home/dmw/src/mitogen/mitogen/service.py:987(get)
     4103    0.041    0.000    3.354    0.001 <stdin>:764(unpickle)
     4103    3.195    0.001    3.310    0.001 {method 'load' of '_pickle.Unpickler' objects}
     4104    0.008    0.000    1.136    0.000 <stdin>:998(__iter__)
     4105    0.017    0.000    1.129    0.000 <stdin>:971(get)
     4105    0.038    0.000    1.110    0.000 <stdin>:2102(get)
     4022    0.052    0.000    1.008    0.000 <stdin>:2147(_get_sleep)
     8001    0.009    0.000    0.898    0.000 <stdin>:442(io_op)
     8044    0.026    0.000    0.897    0.000 <stdin>:1953(_poll)
     4022    0.862    0.000    0.862    0.000 {method 'poll' of 'select.poll' objects}
     4101    0.017    0.000    0.219    0.000 <stdin>:1773(call_service_async)
     4101    0.019    0.000    0.117    0.000 <stdin>:1744(send_async)
     4100    0.005    0.000    0.103    0.000 <stdin>:667(_unpickle_bytes)
     4100    0.098    0.000    0.098    0.000 {built-in method _codecs.latin_1_encode}
     4104    0.008    0.000    0.070    0.000 <stdin>:711(pickled)
     4102    0.003    0.000    0.066    0.000 <stdin>:1784(send)
     4104    0.004    0.000    0.063    0.000 <stdin>:2715(route)
     4104    0.011    0.000    0.059    0.000 <stdin>:2307(defer)
     4104    0.047    0.000    0.050    0.000 {built-in method _pickle.dumps}

.

Main thread (Py2)

Log

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.922    0.922 <stdin>:2991(_dispatch_calls)
        2    0.000    0.000    0.920    0.460 <stdin>:2974(_dispatch_one)
        1    0.000    0.000    0.878    0.878 master:tests/bench/throughput.py:18(transfer)
        1    0.033    0.033    0.877    0.877 master:/home/dmw/src/mitogen/mitogen/service.py:987(get)
     4102    0.004    0.000    0.397    0.000 <stdin>:998(__iter__)
     4103    0.010    0.000    0.394    0.000 <stdin>:971(get)
     4103    0.024    0.000    0.383    0.000 <stdin>:2102(get)
     2889    0.024    0.000    0.318    0.000 <stdin>:2147(_get_sleep)
     5689    0.006    0.000    0.270    0.000 <stdin>:442(io_op)
     5778    0.010    0.000    0.246    0.000 <stdin>:1953(_poll)
     2889    0.233    0.000    0.233    0.000 {built-in method poll}
     4099    0.014    0.000    0.218    0.000 <stdin>:1773(call_service_async)
     4099    0.015    0.000    0.126    0.000 <stdin>:1744(send_async)
     4101    0.015    0.000    0.109    0.000 <stdin>:764(unpickle)
     4101    0.083    0.000    0.086    0.000 {built-in method load}
     4100    0.003    0.000    0.082    0.000 <stdin>:1784(send)
     4102    0.004    0.000    0.079    0.000 <stdin>:2715(route)
     4102    0.013    0.000    0.075    0.000 <stdin>:2307(defer)
     4098    0.073    0.000    0.073    0.000 {method 'write' of 'file' objects}
    15201    0.066    0.000    0.066    0.000 {method 'acquire' of 'thread.lock' objects}
     4102    0.008    0.000    0.061    0.000 <stdin>:711(pickled)
     2800    0.005    0.000    0.043    0.000 <stdin>:2289(_wake)
        2    0.000    0.000    0.043    0.021 <stdin>:2958(_parse_request)
        2    0.000    0.000    0.043    0.021 <stdin>:553(import_module)
      3/2    0.000    0.000    0.043    0.021 {__import__}
      3/1    0.000    0.000    0.043    0.043 <stdin>:1274(load_module)
     4102    0.038    0.000    0.042    0.000 {cPickle.dumps}
     2800    0.005    0.000    0.039    0.000 <stdin>:1454(write)
     2889    0.037    0.000    0.037    0.000 {method 'recv' of '_socket.socket' objects}
        3    0.000    0.000    0.032    0.011 /usr/lib/python2.7/threading.py:597(wait)
        1    0.000    0.000    0.032    0.032 /usr/lib/python2.7/threading.py:309(wait)
     2800    0.031    0.000    0.031    0.000 {posix.write}
     4098    0.006    0.000    0.028    0.000 <stdin>:950(close)
     4100    0.009    0.000    0.025    0.000 <stdin>:893(__init__)
     2889    0.015    0.000    0.017    0.000 <stdin>:2091(_make_cookie)
     4129    0.004    0.000    0.016    0.000 <stdin>:321(to_text)
     4098    0.008    0.000    0.015    0.000 <stdin>:2038(close)
     4131    0.004    0.000    0.011    0.000 {method 'decode' of 'str' objects}

.

Broker thread (Py3):

Log

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006    4.949    4.949 <stdin>:2891(_broker_main)
     9840    0.032    0.000    4.941    0.001 <stdin>:2845(_loop_once)
    26130    0.018    0.000    4.307    0.000 <stdin>:442(io_op)
    16294    0.015    0.000    2.661    0.000 <stdin>:2834(_call)
    26130    0.060    0.000    2.242    0.000 <stdin>:1953(_poll)
     9840    2.172    0.000    2.172    0.000 {method 'poll' of 'select.poll' objects}
     8203    0.018    0.000    1.828    0.000 <stdin>:1573(on_receive)
    12186    0.016    0.000    1.506    0.000 <stdin>:1431(read)
    12186    1.482    0.000    1.482    0.000 {built-in method posix.read}
     4104    0.016    0.000    0.684    0.000 <stdin>:1655(on_transmit)
     8126    0.657    0.000    0.657    0.000 {built-in method posix.write}
     4104    0.009    0.000    0.650    0.000 <stdin>:1454(write)
     8203    0.027    0.000    0.336    0.000 <stdin>:1563(_internal_receive)
    12311    0.112    0.000    0.307    0.000 <stdin>:1595(_receive_one)
     8214    0.024    0.000    0.142    0.000 <stdin>:2651(_async_route)
     3981    0.013    0.000    0.128    0.000 <stdin>:2266(on_receive)
     4108    0.006    0.000    0.079    0.000 <stdin>:2626(_invoke)
     4105    0.005    0.000    0.073    0.000 <stdin>:938(_on_receive)
     4105    0.013    0.000    0.068    0.000 <stdin>:2189(put)
     8091    0.061    0.000    0.061    0.000 {method 'acquire' of '_thread.lock' objects}
     4108    0.060    0.000    0.060    0.000 {method 'join' of 'bytes' objects}
     4105    0.013    0.000    0.034    0.000 <stdin>:1675(_send)

.

Broker thread (Py2):

Log

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003    0.933    0.933 <stdin>:2891(_broker_main)
     4430    0.016    0.000    0.927    0.000 <stdin>:2845(_loop_once)
    11020    0.010    0.000    0.710    0.000 <stdin>:2834(_call)
    15446    0.015    0.000    0.611    0.000 <stdin>:442(io_op)
     4110    0.006    0.000    0.393    0.000 <stdin>:1573(on_receive)
     6914    0.008    0.000    0.264    0.000 <stdin>:1431(read)
     6914    0.249    0.000    0.249    0.000 {posix.read}
    15446    0.024    0.000    0.201    0.000 <stdin>:1953(_poll)
     4102    0.011    0.000    0.199    0.000 <stdin>:1655(on_transmit)
     6991    0.188    0.000    0.188    0.000 {posix.write}
     4102    0.006    0.000    0.186    0.000 <stdin>:1454(write)
     4110    0.011    0.000    0.180    0.000 <stdin>:1563(_internal_receive)
     4430    0.171    0.000    0.171    0.000 {built-in method poll}
     8216    0.073    0.000    0.167    0.000 <stdin>:1595(_receive_one)
     2802    0.011    0.000    0.102    0.000 <stdin>:2266(on_receive)
     8210    0.022    0.000    0.102    0.000 <stdin>:2651(_async_route)
     4106    0.006    0.000    0.060    0.000 <stdin>:2626(_invoke)
     4103    0.005    0.000    0.054    0.000 <stdin>:938(_on_receive)
     4103    0.012    0.000    0.049    0.000 <stdin>:2189(put)
     6912    0.031    0.000    0.031    0.000 {method 'acquire' of 'thread.lock' objects}
     4103    0.011    0.000    0.016    0.000 <stdin>:1675(_send)
     2889    0.003    0.000    0.015    0.000 <stdin>:2213(_wake)
     4119    0.008    0.000    0.013    0.000 <stdin>:651(__init__)
        1    0.000    0.000    0.006    0.006 <stdin>:2871(_broker_shutdown)
    36955    0.004    0.000    0.004    0.000 {len}
     4127    0.003    0.000    0.003    0.000 {vars}
    23354    0.003    0.000    0.003    0.000 {method 'get' of 'dict' objects}
     4103    0.003    0.000    0.003    0.000 {_struct.pack}
     4430    0.003    0.000    0.003    0.000 <stdin>:1970(poll)
        2    0.000    0.000    0.003    0.001 <stdin>:2373(on_receive)
       10    0.000    0.000    0.003    0.000 <stdin>:1421(close)
        8    0.003    0.000    0.003    0.000 {posix.close}
     6912    0.002    0.000    0.002    0.000 {method 'release' of 'thread.lock' objects}
     8215    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
        4    0.000    0.000    0.002    0.000 <stdin>:1516(on_disconnect)
     4119    0.002    0.000    0.002    0.000 {method 'update' of 'dict' objects}
     4112    0.002    0.000    0.002    0.000 {_struct.unpack}

@dw
Copy link
Member Author

dw commented Jan 26, 2019

Broker thread WTFs:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
Py2:     6914    0.249    0.000    0.249    0.000 {posix.read}
Py3:    12186    1.482    0.000    1.482    0.000 {built-in method posix.read}

Main thread WTFs:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
Py2:     4101    0.015    0.000    0.109    0.000 <stdin>:764(unpickle)
Py3:     4103    0.041    0.000    3.354    0.001 <stdin>:764(unpickle)

@dw
Copy link
Member Author

dw commented Jan 26, 2019

For the read() mess, this measures no obvious difference in perf output or runtime.

[19:58:22 eldil!17 mitogen] cat wtf.py 

import mitogen.core
import os

try:
    xrange
except NameError:
    xrange = range


fd = os.open('/dev/zero', os.O_RDONLY)
for x in xrange(200000):
    os.read(fd, mitogen.core.CHUNK_SIZE)

Could /dev/zero really be slow enough to hide the problem? I think not. I'm suspecting this may be due to the GIL

@dw
Copy link
Member Author

dw commented Jan 26, 2019

Python 3 slows down a little bit when an extra thread is present, but only by ~10%ish. Python 3's base overhead is significantly higher, which is obvious by changing the read size to 64/128, but still nowhere nearly enough to explain the profile output

dw added a commit that referenced this issue Jan 26, 2019
dw added a commit that referenced this issue Jan 26, 2019
* python24:
  WIP first run of py24 CI
  issue #477: initial Python 2.4.6 build for CI.
  issue #477: enable git-lfs for tests/data/*.tar.bz2.
  issue #477: import build script for Python 2.4.6.
  issue #477: add mitogen_py24 CI test type.
  issue #477: disable Django parts of module_finder_test on 2.4.
  issue #477: clean up globals after unix_test.
  issue #477: remove unused pytest bits from importer_test.
  issue #477: remove fork use from unix_test.
  parent: don't kill child when profiling=True
  issue #485: import new throuhgput bench
  issue #477: more fork removal
  issue #477: Py2.4 startswith() did not support tuples.
  issue #477: util/fakessh/two_three_compat fixes.
  issue #477: call_function_test fixes for 2.4.
  issue #477: promote setup_gil() to mitogen.utils
  issue #477: fix lxc_test any polyfill import.
  issue #477: stop using fork in responder_test.
  issue #477: stop using fork in service_test.
  issue #477: Python<2.5 ioctl() request parameter was signed.
  issue #477: stop using fork() in parent_test, compatible enumerate().
  issue #477: Popen.terminate() polyfill for Py2.4.
  issue #477: stop using .fork() in router_test, one small 2.4 fix.
  issue #477: document master.Router.max_message_size.
  issue #477: old Py zlib did not include extended exception text.
  issue #477: stop using router.fork() in receiver_test
  issue #477: any() polyfill for lxc_test.
  issue #477: replace type(e) -> __class__ for an exception
  issue #477: old Mock does not throw side_effect exceptions from a list
  issue #477: 2.4 stat() returned int timestamps not float.
  issue #477: set().union(a, b, ..) unsupported on Py2.4.
  issue #477: Logger.log(extra=) unsupported on Py2.4.
  issue #477: fix another Threading.getName() call.
  issue #477: %f date format requires Py2.6 or newer.
  issue #477: make mitogen.fork unsupported on Py<2.6.
  issue #477: Py2.4 dep scanner bytecode difference
  Drop 'alpha' trove classifier
  issue #477: fix another str/bytes mixup.
  issue #477: blacklist 'thread' module to avoid roundtrip on 2.x->3.x
  issue #477: fix 3.x failure in new target.set_file_mode() function.
  issue #477: fix 3.x failure in new target.set_file_mode() function.
  issue #477: fix 2 runner tests on Ansible 2.7.
  issue #477: fix 3.x test regressions.
  issue #477: fix new KwargsTest on Python 3.x.
  issue #477: ModuleFinder now returns Unicode module names.
  issue #477: Python3 does not have Pickler.dispatch.
  issue #477: ModuleFinder test fixes.
  issue #477: Ansible 2.3 compatible regression/all.yml.
  issue #477: Ansible 2.3 requires placeholder module for assert_equals
  issue #477: build a CentOS 5/Py2.4 container + playbook compat fixes.
  issue #477: use PY24 constant rather than explicit test.
  issue #477: backport mitogen.master to Python 2.4.
  issue #477: parent: make iter_read() log disconnect reason.
  issue #477: backport ansible_mitogen.runner to 2.4.
  issue #477: backport various test modules to Python 2.4.
  issue #477: backport ansible_mitogen/target.py to Python2.4
  issue #477: add all() polyfill to custom_python_detect_environmnet
  issue #477: polyfill partition() use in mitogen.parent.
  issue #477: polyfill partition() use in mitogen.service.
  issue #477: polyfill partition() use in mitogen.ssh.
  issue #477: vendorize the last 2.4-compatible simplejson
  issue #477: _update_linecache() must append newlines.
  issue #415, #477: Poller must handle POLLHUP too.
  issue #477: Python 2.5 needs next() polyfill too.
  issue #477: explicitly populate Py2.4 linecache from Importer.
  issue #477: rename and add tests for polyfill functions.
  issue #477: various core.py docstring cleanups.
  issue #477: Ansible 2.3 module output format difference.
  issue #477: Ansible 2.3 cannot use when: on an include.
  issue #477: tests: use Ansible 2.3-compatible include rather than import
  issue #477: serve up junk ansible/__init__.py just like Ansible.
  issue #477: testlib: Py2.4 did not have BaseException.
  issue #477: master: ability to override ModuleResponder output.
  issue #477: yet another bug in core._partition().
  issue #477: 2.4.x compat fixes for mitogen.service.
  issue #477: Py2.4 lacks all().
  issue #477: Ansible 2.3 had stricter arg spec format.
  issue #477: make CallError serializable on 2.4.
  issue #477: log full module name when SyntaxError occurs.
  issue #477: more Py2.4 (str|unicode).partition().
  issue #477: Py2.4 cannot tolerate unicode kwargs.
  issue #477: Py2.4 lacks BaseException.
  issue #477: Py2.4: enumerate() may return stopped threads.
  issue #477: Py2.4: more unicode.rpartition() usage.
  issue #477: Python 2.4 type(exc) returns old-style instance.
  issue #477: Python 2.4 lacked str.partition.
  issue #477: Python 2.4 lacked Thread.name.
  issue #477: Python 2.4 lacked context managers.
  issue #477: Python <2.5 did not have combined try/finally/except.
  issue #477: older Ansibles had no vars plugin base class.
  issue #477: Python <2.5 lacked any().
  issue #477: Python <2.6 lacked rpartition().
  issue #477: make CallError inherit from object for 2.4/2.5.
  issue #477: 2.4/2.5 had no better poller than poll().
dw added a commit that referenced this issue Jan 27, 2019
* origin/dmw: (135 commits)
  tests: just disable the test.
  tests: hopefully fix this dumb test for the final time
  docs: update Changelog; closes #477.
  issue #477: use MITOGEN_INVENTORY_FILE everywhere.
  issue #477: hacksmash weird 2.3 inventory_file var issue.
  issue #477: travis.yml typo.
  issue #477: fix sudo_args selection.
  issue #477: one more conditional test.
  issue #477: enable Ansible 2.3.3 CI.
  issue #477: some more conditional tests.
  docs: update Changelog.
  issue #477 / ansible: avoid a race in async job startup.
  issue #477: use assert_equal for nicer debug.
  issue #477: fix source of become_flags on 2.3.
  issue #477: add Connection.homedir test.
  core: docstring tidyups.
  core: ensure early debug messages are logged correctly.
  core: log disconnection reason.
  issue #477: target.file_exists() wrapper.
  issue #477: introduce subprocess isolation.
  ansible: docstring fixes.
  issue #477: paper over Ansible 2.3 flag handling difference
  issue #477: update forking_correct_parent for subprocess isolation
  issue #477: shlex.split() in 2.4 required bytes input.
  issue #477: get rid of perl JSON module requirement.
  issue #477: Ansible 2.3 did not support gather_facts min subset.
  issue #477: CentOS 5 image requires perl installed too.
  issue #477: missing stub-su.py from 137f5fa
  issue #477: 2.4-compatible syntax.
  issue #477: clearing glibc caches is not possible on Py2.4.
  parent: --with-pydebug bootstrap could fail due to corrupted stream
  issue #477: install simplejson for vanilla tests.
  docs: update Changelog.
  ansible: synchronize module needs '.docker_cmd' attr for Docker plugin.
  issue #477: add basic su_test and Py2.4 polyfill.
  issue #477: import updated Python build scripts
  ci: don't use the TTY->pipe hack except on Travis where it's needed.
  WIP first run of py24 CI
  issue #477: initial Python 2.4.6 build for CI.
  issue #477: enable git-lfs for tests/data/*.tar.bz2.
  issue #477: import build script for Python 2.4.6.
  issue #477: add mitogen_py24 CI test type.
  issue #477: disable Django parts of module_finder_test on 2.4.
  issue #477: clean up globals after unix_test.
  issue #477: remove unused pytest bits from importer_test.
  issue #477: remove fork use from unix_test.
  parent: don't kill child when profiling=True
  issue #485: import new throuhgput bench
  issue #477: more fork removal
  issue #477: Py2.4 startswith() did not support tuples.
  issue #477: util/fakessh/two_three_compat fixes.
  issue #477: call_function_test fixes for 2.4.
  issue #477: promote setup_gil() to mitogen.utils
  issue #477: fix lxc_test any polyfill import.
  issue #477: stop using fork in responder_test.
  issue #477: stop using fork in service_test.
  issue #477: Python<2.5 ioctl() request parameter was signed.
  issue #477: stop using fork() in parent_test, compatible enumerate().
  issue #477: Popen.terminate() polyfill for Py2.4.
  issue #477: stop using .fork() in router_test, one small 2.4 fix.
  issue #477: document master.Router.max_message_size.
  issue #477: old Py zlib did not include extended exception text.
  issue #477: stop using router.fork() in receiver_test
  issue #477: any() polyfill for lxc_test.
  issue #477: replace type(e) -> __class__ for an exception
  issue #477: old Mock does not throw side_effect exceptions from a list
  issue #477: 2.4 stat() returned int timestamps not float.
  issue #477: set().union(a, b, ..) unsupported on Py2.4.
  issue #477: Logger.log(extra=) unsupported on Py2.4.
  issue #477: fix another Threading.getName() call.
  issue #477: %f date format requires Py2.6 or newer.
  issue #477: make mitogen.fork unsupported on Py<2.6.
  issue #477: Py2.4 dep scanner bytecode difference
  Drop 'alpha' trove classifier
  issue #477: fix another str/bytes mixup.
  issue #477: blacklist 'thread' module to avoid roundtrip on 2.x->3.x
  issue #477: fix 3.x failure in new target.set_file_mode() function.
  issue #477: fix 3.x failure in new target.set_file_mode() function.
  issue #477: fix 2 runner tests on Ansible 2.7.
  issue #477: fix 3.x test regressions.
  issue #477: fix new KwargsTest on Python 3.x.
  issue #477: ModuleFinder now returns Unicode module names.
  issue #477: Python3 does not have Pickler.dispatch.
  issue #477: ModuleFinder test fixes.
  issue #477: Ansible 2.3 compatible regression/all.yml.
  issue #477: Ansible 2.3 requires placeholder module for assert_equals
  issue #477: build a CentOS 5/Py2.4 container + playbook compat fixes.
  issue #477: use PY24 constant rather than explicit test.
  issue #477: backport mitogen.master to Python 2.4.
  issue #477: parent: make iter_read() log disconnect reason.
  issue #477: backport ansible_mitogen.runner to 2.4.
  issue #477: backport various test modules to Python 2.4.
  issue #477: backport ansible_mitogen/target.py to Python2.4
  issue #477: add all() polyfill to custom_python_detect_environmnet
  issue #477: polyfill partition() use in mitogen.parent.
  issue #477: polyfill partition() use in mitogen.service.
  issue #477: polyfill partition() use in mitogen.ssh.
  issue #477: vendorize the last 2.4-compatible simplejson
  issue #477: _update_linecache() must append newlines.
  issue #415, #477: Poller must handle POLLHUP too.
  ...
@dw dw added the performance Performance problems label Feb 1, 2019
@ghost
Copy link

ghost commented Jun 12, 2019

I have a feeling that this could be related: paramiko/paramiko#1141

@lsh-0
Copy link

lsh-0 commented Jun 18, 2019

I have a feeling that this could be related: paramiko/paramiko#1141

I updated the timings in that issue for the latest paramiko but there is little/no difference. I'd love to know if you guys get to the bottom of what this massive difference in performance is.

@ghost
Copy link

ghost commented Jun 18, 2019

Great, thank you! I would like to find out if we have a non-linear performance curve here with mitogen just like Paramiko. I think that would be very telling. If this is only a linear decrease, it seems different (though maybe still partly related).

@ghost
Copy link

ghost commented Jun 18, 2019

@dw
Copy link
Member Author

dw commented Jul 18, 2019

The Paramiko problem was separate, Mitogen doesn't actually use Paramiko at all.

The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.

The problem is that when communicating using Pickle protocol v2 or older (i.e. with Python 2), Python 3 represents all 8-bit bytestrings in the pickle by transliterating them to Unicode using the LATIN-1 codec, and undoing the result in the remote side. Unicode in turn is represented in pickle as UTF-8, so every chunk of raw data is doubly transcoded (and copied) during serialization and deserialization.

Basically this can't be fixed -- the compatibility hack is needed to support yet more compatibility hacks elsewhere in pickle to make it appear like 2.x and 3.x pickles are compatible.

That means every 8-bit bytestring experiences on average 1.5x size expansion on the wire, and that size is variable depending on the actual data bytes being transmitted (thus making it unpredictable, and impossible to correctly align to frame boundaries), and every bytestring must roundtrip first through the UTF-8 decoder then the LATIN-1 decoder every time it is deserialiezd.

It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.

@RonnyPfannschmidt
Copy link

i recall that this was the reason why exec-net eventually made a own serializer, pickle just turned out to be a liability

personally i would like to see some kind of layering, so each channel/transport endpoint can have a own serializer setup, allowing to configure for raw data, json, pickle, and others instead of enforcing a specific serializer

@dw
Copy link
Member Author

dw commented Jul 18, 2019

I'm generally anti-modularity unless a use case makes absolute sense. One of the problems discovered here, broken frame alignment, is alone likely responsible for 25%+ of the slowdown. Optimizations like that require precise knowledge of lower layers, which can be hard to maintain through an abstraction. In other words, particularly in this case, I believe it is much better to have a single option that works very well for many cases, than many options catering poorly to every case

@ghost
Copy link

ghost commented Jul 18, 2019

Wow, thank you for digging into this so much. I appreciate it!

@moreati
Copy link
Member

moreati commented Jul 18, 2019

The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.

I stumbled across this too. As a shot in the dark, since Mitogen controls both the pickling and unpickling, the extension registry could be used to implement something without throwing out all of pickle. I need to think about that.

It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.

AFAICT the options aren't great. Whatever is chosen needs to be implemented in C, for any hope of speed, and that means binary dependencies. That will be hard enough across 2.7, 3.5, 3.6. Adding 2.6 makes it harder, adding 2.4 or 2.5 verges on impossible.

I'll try an integration of pikl, but currently it will suffer the same problem as discussed here. If I can get the time and perseverance I think a Rust based implementation of pikl, that implemented protocol 3 (or higher) on all supported Python versions would be a win. I already tried backporting the _pickle module from 3.6 and 3.7 to 3.5 or earlier, without success. Alternatively I might be able to get a Cython module, derived from pickle.py within 10x speed of _pickle.c

Finally, I've been meaning to look at the format CBOR more closely (and the package cbor2) as possible basis to work on. However IIRC, it can't round trip tuples in it's present form.

@moreati
Copy link
Member

moreati commented Jul 18, 2019

In case anyone needs some for comparison, I made pickles with various CPython versions https://github.com/moreati/pickle-fuzz/tree/master/pickles

@moreati
Copy link
Member

moreati commented Jul 18, 2019

the extension registry could be used to implement something without throwing out all of pickle

I don't think this buys us anything.

However, zodbpickle may be good enough, at least as a stopgap. The latest release supports CPython 2.7 & 3.4-3.7, a release from 2015 supports CPython 2.6. Critically it supports Pickle protocol 3 on Python 2.x, and it has a mechanism for pickling Python 2.x str as a byte string, rather than a textual string. E.g.

bpython version 0.18 on top of Python 2.7.15+ /home/alex/src/pikl/v27/bin/python2
>>> import zodbpickle, zodbpickle.fastpickle, zodbpickle.pickletools_2
>>> zodbpickle.pickletools_2.dis(b'\x80\x03C\x03abc.')
    0: \x80 PROTO      3
    2: C    SHORT_BINBYTES 'abc'
    7: .    STOP
highest protocol among opcodes = 3
>>> zodbpickle.fastpickle.loads(b'\x80\x03C\x03abc.')
'abc'
>>> zodbpickle.fastpickle.dumps('abc', protocol=3)
'\x80\x03U\x03abcq\x01.'
>>> zodbpickle.fastpickle.dumps(zodbpickle.binary('abc'), protocol=3)
'\x80\x03C\x03abc.'

Edit 1: pikl is currently forked from zodbpickle
Edit 2: zodpickle 1.0.4 has pre-compiled wheels for CPython {2.7, 3.4, 3.5, 3.6, 3.7} x {i686, x86_64} x {linux, windows}; and MacOS build(s) covering the same CPython versions, but not every permutation of processor architecture.
Edit 3: zodbpickle also includes the intimate knowledge about module renames from Python 2.x -> 3.x, that pickle has.

@dw
Copy link
Member Author

dw commented Jul 18, 2019

I'm having a play with your pickle-fuzz repo at the moment :) Feeling slightly more optimistic after a few hours to think about it, at least in the case of file transfer, it is possible to simply remove the use of pickle -- it is only there to allow the file sender to abort the transfer, which could simply be signalled some other way. That would fix the most egregious inefficiency, but only as a stopgap measure

Regarding a replacement for pickle, losing e.g. tuples might not be the end of the world, we could simply bump the major version number. It's more important at least for me that whatever serialization used is subjectively "convenient" and does not regularly feature prominently in profile output. Really the bar isn't too high at this point -- we could potentially implement some serialization in bash and it'd still beat what we have on Python 3 right now 😃

@dw
Copy link
Member Author

dw commented Jul 18, 2019

I haven't profiled yet, but this transcoding madness is possibly a large part of why Mitogen for Ansible has always been slower on 3.x -- we rely on efficient binary serialization for transfer of all kinds of data (like Ansible modules) in the extension

@moreati
Copy link
Member

moreati commented Jul 18, 2019

To be clear, pickle-fuzz is a dumping ground of experiments and half finished prototypes. The output of all that is pikl.

dw added a commit that referenced this issue Nov 25, 2019
@dw
Copy link
Member Author

dw commented Nov 25, 2019

Hey @moreati, I pushed a branch with a first pass at this, if you fancy taking a look. The semantics change likely needs a version number bump, so fixing this could be the first patch to 0.3.x.

Random thoughts:

  • Primary goal is efficient binary transfer, wire size isn't quite as important.
  • Memoization / cyclical structures: expensive to do in Python, creates DoS issues and seemingly never needed (by me) so far
  • Needs much better tests for truncated / corrupt input, and probably quite a bit more input validation
  • Per this old post eliminating BytesIO from decode is likely to help a lot, but it means bootstrap size must increase some more (maybe that's not a bad thing, the current draft is pretty YOLO)

dw added a commit that referenced this issue Nov 25, 2019
dw added a commit that referenced this issue Nov 25, 2019
dw added a commit that referenced this issue Nov 25, 2019
@dw
Copy link
Member Author

dw commented Nov 25, 2019

Although the draft is fairly efficient on CPU, on the wire it's pretty wasteful:

>>> len(mitogen.core.encode((1, 2, 3, 4)))
25
>>> len(pickle.dumps((1, 2, 3, 4), protocol=2))
15

It's because all the container types (and string) use fixed 32bit integers for lengths, and there is nothing smaller than a 32bit integer. The more specializations that are added the slower everything will get, but it is tempting (either now or later) to add e.g.

  • (LIST|TUPLE|SET|DICT)(8|16|32): containers with smaller size prefix
  • (BYTES|UNICODE)(8|16|32): same but for strings
  • (POS|NEG)(8|16|32): same for integers

Thatt'll likely blow out the bootstrap a little more. I guess the question is use cases. For the Ansible extension since SSH compression was always used, the fixed integer types will have basically no effect on the wire, but what about other cases?

@moreati
Copy link
Member

moreati commented Nov 25, 2019

Looking

@dw
Copy link
Member Author

dw commented Nov 25, 2019

FWIW I think this draft is ugly as sin :) It's motivated as usual by file size, but I don't even think that's a pressing concern.. getting serialization right is important enough to spend bytes on. I think the size bar for core.py is probably something like 40kb (hopefully it never happens), then it's time to start really worrying, so there is plenty of room to play with if it produces a nicer solution.

@moreati
Copy link
Member

moreati commented Nov 26, 2019

Initial thoughts

Benchmark

  • Encoding many small objects takes roughly 10 times longer than cpickle. Which is roughly par for the course with every pure Python serializer I've tried.
  • pencode (part of https://github.com/lordmauve/chopsticks, Python 3.x only) is faster to encode, but slower to decode.
  • If you're content to pay the CPU cost of a pure Python serializer, perhaps pikl would be a contender. It would preserve the ability to serialize large integers, frozenset etc and since it is based on pickle protocol 3 it shouldn't incur the cost of _codes.encode. I'll do a mockup this weekend

Bike shed

  • I'd prefer dumps/loads, or serialize/deserialize. I think encode/decode would cause too much confusion with character encodings such as UTF-8.

@dw
Copy link
Member Author

dw commented Nov 26, 2019

+1 on dumps/loads.

I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.

Could we overcome this LATIN1 transcoding problem with pikl? There would need to be some way to ship any solution to 3.x targets since they are the root of the problem, so I think it either needs to be part of core.py, or there must be a better way to ship additional modules alongside core.py during bootstrap.

@moreati
Copy link
Member

moreati commented Nov 29, 2019

I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.

pencode supports Python 2.7 (I mistakenly thought not), adding Python 2.4 support looks doable.

Could we overcome this LATIN1 transcoding problem with pikl?

I need to investigate further. Until this moment I thought that would require upgrading pikl to pickle protocol 4. Looking again, it might be fine with the existing protocol 3 support. If protocol 4 were necessary it would require a pure Python implementation on Python 2.x, with the associated CPU hit.

Research progress:

Benchmark CPython 2.7 CPython 3.7
cpickle,proto=2,dumps 21.5 ms +- 0.7 ms 8.82 ms +- 0.20 ms
cpickle,proto=2,loads 18.1 ms +- 0.5 ms 8.94 ms +- 0.25 ms
cpickle,proto=3,dumps 7.99 ms +- 0.17 ms
cpickle,proto=3,loads 8.94 ms +- 0.25 ms
pencode,proto=None,dumps 50.3 ms +- 1.3 ms 51.1 ms +- 1.2 ms
pencode,proto=None,loads 142 ms +- 7 ms 132 ms +- 7 ms
pencode_read,proto=None,dumps 50.2 ms +- 0.8 ms 51.6 ms +- 0.7 ms
pencode_read,proto=None,loads 78.9 ms +- 1.8 ms 79.0 ms +- 1.2 ms
marshal,proto=1,dumps 4.69 ms +- 0.05 ms 4.97 ms +- 0.07 ms
marshal,proto=1,loads 12.7 ms +- 0.3 ms 6.84 ms +- 0.24 ms
mrna,proto=None,dumps 85.0 ms +- 0.8 ms 75.2 ms +- 1.3 ms
mrna,proto=None,loads 138 ms +- 3 ms 99.7 ms +- 1.9 ms

marshal is very intriguing.
Pros

  • Very fast, faster even than cPickle/_pickle
  • Very available, builtin to stdlib
  • Stable wire format across Python versions, protocol 1 is supported by Python 2.4
  • bytes clean, binary strings go in and binary strings go out, no LATIN1 transcoding AFAICT

Cons

  • Very undocumented
  • Dangerous, from a remote code execution (RCE) perspective. Probably more so than pickle.
  • Doesn't handle self referencing structures (TBC)

@dw
Copy link
Member Author

dw commented Nov 29, 2019

I love the name mrna :)

I'd like to do some archeology and see what kept execnet away from marshal. It's definitely had historical security issues like stack/integer overflows, the question is whether they're relevant here.

If we support talking to Python 2.4, and 2.4 has a marshal module with some kind of RCE in it, it's fine so long as any trusted parent is running a patched marshal, since parent->child RCE is the whole point of the library. But e.g. going host A -> B -> C where bastion B is running an ancient Python, this cannot work where any response from C is interpreted by B (e.g. something like REQUEST_MODULE).

It'd be great to use this opportunity to eliminate any fear surrounding the serializer entirely. I think the strongest contender so far is pencode -- but its support for cyclical structures opens trusted code to algorithmic attacks (infinite recursion or loops), so the icing on the cake (I think) would be pencode minus memoization.

@dw
Copy link
Member Author

dw commented Nov 29, 2019

An ominous message from the execnet sources (earliest commit in the current Git):

        # XXX marshal.dumps doesn't work for exchanging data across Python
        # version :-(((  There is no sane solution, short of a custom
        # pure Python marshaller

This may be referring to pre-2.4 Pythons, though

@moreati
Copy link
Member

moreati commented Nov 29, 2019

Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4

@moreati
Copy link
Member

moreati commented Dec 6, 2019

Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4

Fixed up the pickle4 backport enough to benchmark on Python 2.7 and 3.7. It's slow. There's a lot of branching, in pure python. Learned some ancient Python lore

Next, I'm going to look at pencode some more.

@moreati
Copy link
Member

moreati commented Dec 7, 2019

pencode_read5 is a variant of pencode. It decodes between 2 and 3 times faster than plain pencode. This brings it within 10x of cpickle.

Benchmark CPython 2.7 CPython 3.6
cpickle,proto=2,dumps 21.1 ms +- 0.5 ms 9.09 ms +- 0.14 ms
cpickle,proto=2,loads 19.0 ms +- 0.9 ms 9.71 ms +- 0.22 ms
pencode,proto=None,dumps 49.7 ms +- 0.5 ms 64.0 ms +- 0.7 ms
pencode,proto=None,loads 141 ms +- 7 ms 194 ms +- 13 ms
pencode_read5,proto=None,dumps: 49.0 ms +- 0.8 ms 64.1 ms +- 0.8 ms
pencode_read5,proto=None,loads 59.9 ms +- 0.9 ms 78.0 ms +- 8.1 ms

To do this pencode_read5 removes the dedicated the op codes for None, True, and False. The uncompressed wire size becomes slightly larger as a result

moreati added a commit to moreati/mitogen that referenced this issue Jul 31, 2023
This commit is not intended to be merged. It exists to help
characterise/diagnose performance regression(s) of Mitogen × Python 3.x.

It will not work with Python 2.x.

See mitogen-hq#485
@moreati
Copy link
Member

moreati commented Jul 31, 2023

Short to medium term ideas

  • Pickle protocol negotiation. e.g. if all parties are Python 3.x then use protocol 3, or a per-path negotiation ala TCP path MTU.
  • Configurable pickle protocol. e.g. allow master to dictate protocol=N, if any node can't talk that then throw an error.

PatrickCoakley23 pushed a commit to cyara/mitogen that referenced this issue Nov 10, 2023
…oplefonesk-2-new-number

15558-slovakia-peoplefonesk-2-new-number
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Code feature that hinders desired execution outcome performance Performance problems
Projects
None yet
Development

No branches or pull requests

4 participants