New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greenlet tree #755

Closed
mahmoud opened this Issue Feb 29, 2016 · 12 comments

Comments

Projects
None yet
5 participants
@mahmoud

mahmoud commented Feb 29, 2016

At PayPal, we've found it very useful to extend the gevent.Greenlet class to support tracking its spawning Greenlet. This creates a tree that facilitates many logging and instrumentation scenarios. This helps stitch together stack traces and generally navigate complex Greenlet-spawning paths.

That brings us to the best news, which is that we have a concrete implementation already, production-tested for almost a year now. Overhead is negligible, even in our highest performance use cases (1000+ requests per per second), and references are properly handled.

We're happy to contribute it, if there's interest. If there are questions, or anything we should know before submitting the pull requests, just let us know in the comments, or contact me directly (my work email is in my profile).

(cc @jayalane @doublereedkurt)

@kurtbrose

This comment has been minimized.

kurtbrose commented Mar 1, 2016

Just for reference, this is the additional stuff we wrap around greenlet in live:

https://github.com/paypal/support/blob/master/support/async.py#L38-L60

In addition to the attribute spawn_parent which is a weakref back to the spawning greenlet, it also keeps a light-weight stack trace of the spawn point (list of code object and line numbers), and also creates a dictionary shared among the "spawn tree" of greenlets spawned from the same parent.

(The shared spawn-tree local dictionary is used to share state among all the children of a particular request handling greenlet from a streamserver. This part definitely doesn't belong on the root greenlet class -- it relies on the transition from gevent.greenlet.Greenlet to support.async.Greenlet to partition the spawn-trees from each other.)

Here's an isolated form of the spawn-parent-tracking greenlet:

https://gist.github.com/kurtbrose/edc5968124c6554a47b6

Something else that we've found helpful:

https://gist.github.com/kurtbrose/25b48114de216a5e55df

We use this to assign minimized, non-overlapping IDs to greenlets. This is needed to integrate with existing PayPal infrastructure built around the idea of a small, fixed number of processes or threads as the units of concurrency. This way the first greenlet to reserve an ID becomes "thread" 1, the second becomes "thread" 2. If "thread" 1 finishes, the next greelet will be assigned "thread' 1 again since that id is now free.

It took a couple of iterations to get that right (log-n time insert + removal; hooking weakref callbacks properly).

[edit: fixed links]

@munro

This comment has been minimized.

munro commented Apr 16, 2016

I would love to see this contributed! We get lots of exceptions from gevent/greenlet.py in get at line 274, and without the full stack trace it makes it hard to debug the issue.

@jamadden

This comment has been minimized.

Member

jamadden commented Apr 16, 2016

Is that with 1.1? 1.1 keeps and retries the original stack in that scenario.

@munro

This comment has been minimized.

munro commented Apr 16, 2016

awesome! I'm on 1.0.2, I knew I saw a commit to add that in master a bit ago. XD but I didn't see it when I searched the change log. did I misunderstand this issue then?

@jamadden

This comment has been minimized.

Member

jamadden commented Apr 16, 2016

It was all the way back I. 1.1a1: http://www.gevent.org/changelog.html#a1-jun-29-2015

  • (Experimental.) Waiting on or getting results from greenlets that raised exceptions now usually raises the original traceback. This should assist things like Sentry to track the original problem. See issue #450 and issue #528 by Rodolfo and Eddi Linder and issue #240

It was extended in 1.1a2 and maybe more later but those were the big ones

  • (Experimental) Exceptions raised from iterating using the ThreadPool or Group mapping/application functions should now have the original traceback.
@cookkkie

This comment has been minimized.

cookkkie commented Jan 22, 2018

Hello @kurtbrose, the gist links are not working anymore. Do you have a copy please?

EDIT: Nervermid, I just changed doublereedkurt to kurtbrose 👍

jamadden added a commit that referenced this issue Feb 21, 2018

Add spawn_tree_locals, spawning_greenlet and spawning_stack to Greenlet
Based on #755.

A comment in the code goes into detail about the timing. Here it is
again:

 Timings taken Feb 21 2018 prior to integration of #755
 python -m perf timeit -s 'import gevent' 'gevent.Greenlet()'
 3.6.4       : Mean +- std dev: 1.08 us +- 0.05 us
 2.7.14      : Mean +- std dev: 1.44 us +- 0.06 us
 PyPy2 5.10.0: Mean +- std dev: 2.14 ns +- 0.08 ns

 After the integration of spawning_stack, spawning_greenlet,
 and spawn_tree_locals on that same date:
 3.6.4       : Mean +- std dev: 8.92 us +- 0.36 us ->  8.2x
 2.7.14      : Mean +- std dev: 14.8 us +- 0.5 us  -> 10.2x
 PyPy2 5.10.0: Mean +- std dev: 3.24 us +- 0.17 us ->  1.5x

Selected bench_spawn output on 3.6.4 before:

//gevent36/bin/python src/greentest/bench_spawn.py eventlet --ignore-import-errors
using eventlet from //gevent36/lib/python3.6/site-packages/eventlet/__init__.py
spawning: 11.93 microseconds per greenlet
sleep(0): 23.49 microseconds per greenlet

//gevent36/bin/python src/greentest/bench_spawn.py gevent --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 3.39 microseconds per greenlet
sleep(0): 17.59 microseconds per greenlet

//gevent36/bin/python src/greentest/bench_spawn.py geventpool --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 8.71 microseconds per greenlet

//gevent36/bin/python src/greentest/bench_spawn.py geventraw --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 2.09 microseconds per greenlet

//gevent36/bin/python src/greentest/bench_spawn.py none --ignore-import-errors
    noop: 0.33 microseconds per greenlet

And after:

//gevent36/bin/python bench_spawn.py gevent --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 12.99 microseconds per greenlet -> 3.8x

//gevent36/bin/python bench_spawn.py geventpool --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 19.49 microseconds per greenlet -> 2.2x

//gevent36/bin/python bench_spawn.py geventraw --ignore-import-errors
using gevent from //src/gevent/__init__.py
spawning: 4.57 microseconds per greenlet -> 2.2x

We're approximately the speed of eventlet now.

Refs #755

jamadden added a commit that referenced this issue Feb 22, 2018

@mahmoud

This comment has been minimized.

mahmoud commented Feb 23, 2018

@jamadden That's amazing! Thanks for the followup! Regarding the comments about performance in the commit message, at this low of a level, it can really pay to avoid creating instances of objects. That's why we landed on creating simple lists and then providing utilities for manipulating them. Few application developers work with this level directly, anyways, so that bit of friction is rarely a concern.

In any case, very pleasantly surprised to see #1115 and #1116 land! Thanks for all your hard work!

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 23, 2018

Regarding the comments about performance in the commit message, at this low of a level, it can really pay to avoid creating instances of objects. That's why we landed on creating simple lists and then providing utilities for manipulating them.

I did benchmark that and I found that once I had everything compiled with Cython, the bottleneck was actually in getting frame.f_code and frame.f_lineno. Everything else compiled into direct C function calls, but that still had to go through generic getattr operations.

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 23, 2018

@mahmoud Is #1116 going to fit your needs, do you think?

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 23, 2018

Regarding the comments about performance in the commit message, at this low of a level, it can really pay to avoid creating instances of objects. That's why we landed on creating simple lists and then providing utilities for manipulating them.

I did benchmark that and I found that once I had everything compiled with Cython, the bottleneck was actually in getting frame.f_code and frame.f_lineno. Everything else compiled into direct C function calls, but that still had to go through generic getattr operations.

One possible way it could be a bit faster: tuples use a freelist, objects don't. If the greenlet and its stack are short-lived, a freelist could be useful. Cython has a decorator to make objects use freelists, but I wasn't able to get it to work in the .pxd file and quit before I tried an import-mock dance in the .py file.

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 24, 2018

FWIW, with python -m perf timeit -s 'from gevent import Greenlet' 'Greenlet()' I get:

  • 3.65us +- 0.17us for the current code
  • 3.43us +- 0.19 using a freelist (7% faster)
  • 3.04us +- 0.13 using just tuples (12% faster than freelist, 20% faster than current)
  • 2.33us +- 0.16us if I stop accessing f_code and f_lineno from the frame altogether, i.e., that accounts for about 30% of the runtime all by itself

Using a freelist (what I had originally had in mind) turns out to be complicated. The extra .4us from using tuples is small, but may be worth the added complexity when it comes to accessing the stack (if we take the premise that such access is rare).

@mahmoud

This comment has been minimized.

mahmoud commented Feb 24, 2018

@jamadden yeah! talking to @kurtbrose, it does fit the bill, would have totally used this instead of the code from the gists. I do recommend tuples and util functions for the performance boost, but it's not dealbreaking material imo.

jamadden added a commit that referenced this issue Feb 24, 2018

Speed up Greenlet creation on CPython
Two ways: store tuples instead of _frame objects and use direct access
to two of the attributes of the CPython frame objects.

Benchmarks:

+------------------------+-----------------+------------------------------+
| Benchmark              | spawn_27_master | spawn_27_tuple2              |
+========================+=================+==============================+
| eventlet sleep         | 9.12 us         | 8.77 us: 1.04x faster (-4%)  |
+------------------------+-----------------+------------------------------+
| gevent spawn           | 14.5 us         | 13.2 us: 1.10x faster (-9%)  |
+------------------------+-----------------+------------------------------+
| gevent sleep           | 1.63 us         | 1.86 us: 1.14x slower (+14%) |
+------------------------+-----------------+------------------------------+
| geventpool spawn       | 30.4 us         | 23.6 us: 1.29x faster (-22%) |
+------------------------+-----------------+------------------------------+
| geventpool sleep       | 4.30 us         | 4.55 us: 1.06x slower (+6%)  |
+------------------------+-----------------+------------------------------+
| geventpool join        | 1.70 us         | 1.83 us: 1.08x slower (+8%)  |
+------------------------+-----------------+------------------------------+
| gevent spawn kwarg     | 16.5 us         | 13.5 us: 1.22x faster (-18%) |
+------------------------+-----------------+------------------------------+
| geventpool spawn kwarg | 30.5 us         | 23.9 us: 1.27x faster (-22%) |
+------------------------+-----------------+------------------------------+

Not significant (7): eventlet spawn; geventraw spawn; geventraw sleep;
none spawn; eventlet spawn kwarg; geventraw spawn kwarg; none spawn
kwarg

+------------------------+-----------------+------------------------------+
| Benchmark              | spawn_36_master | spawn_36_tuple2              |
+========================+=================+==============================+
| gevent spawn           | 13.2 us         | 11.9 us: 1.12x faster (-10%) |
+------------------------+-----------------+------------------------------+
| gevent sleep           | 1.71 us         | 1.90 us: 1.11x slower (+11%) |
+------------------------+-----------------+------------------------------+
| geventpool spawn       | 19.9 us         | 15.9 us: 1.25x faster (-20%) |
+------------------------+-----------------+------------------------------+
| geventpool sleep       | 3.54 us         | 3.75 us: 1.06x slower (+6%)  |
+------------------------+-----------------+------------------------------+
| geventpool spawn kwarg | 20.3 us         | 15.9 us: 1.27x faster (-22%) |
+------------------------+-----------------+------------------------------+
| geventraw spawn kwarg  | 5.80 us         | 6.10 us: 1.05x slower (+5%)  |
+------------------------+-----------------+------------------------------+

Not significant (9): eventlet spawn; eventlet sleep; geventraw spawn;
geventraw sleep; none spawn; geventpool join; eventlet spawn kwarg;
gevent spawn kwarg; none spawn kwarg

+------------------+-------------------+------------------------------+
| Benchmark        | spawn_pypy_master | spawn_pypy_tuple2            |
+==================+===================+==============================+
| eventlet spawn   | 30.5 us           | 28.9 us: 1.05x faster (-5%)  |
+------------------+-------------------+------------------------------+
| eventlet sleep   | 3.39 us           | 3.19 us: 1.06x faster (-6%)  |
+------------------+-------------------+------------------------------+
| gevent spawn     | 9.89 us           | 17.2 us: 1.73x slower (+73%) |
+------------------+-------------------+------------------------------+
| gevent sleep     | 3.14 us           | 3.99 us: 1.27x slower (+27%) |
+------------------+-------------------+------------------------------+
| geventpool spawn | 12.3 us           | 20.1 us: 1.63x slower (+63%) |
+------------------+-------------------+------------------------------+

Not significant (1): geventpool sleep

+------------------------+---------------+-------------------------------+
| Benchmark              | spawn_36_13a1 | spawn_36_tuple2               |
+========================+===============+===============================+
| eventlet spawn         | 14.0 us       | 13.2 us: 1.06x faster (-6%)   |
+------------------------+---------------+-------------------------------+
| gevent spawn           | 4.25 us       | 11.9 us: 2.79x slower (+179%) |
+------------------------+---------------+-------------------------------+
| gevent sleep           | 2.78 us       | 1.90 us: 1.46x faster (-32%)  |
+------------------------+---------------+-------------------------------+
| geventpool spawn       | 10.4 us       | 15.9 us: 1.52x slower (+52%)  |
+------------------------+---------------+-------------------------------+
| geventpool sleep       | 5.52 us       | 3.75 us: 1.47x faster (-32%)  |
+------------------------+---------------+-------------------------------+
| geventraw spawn        | 2.56 us       | 5.09 us: 1.99x slower (+99%)  |
+------------------------+---------------+-------------------------------+
| geventraw sleep        | 738 ns        | 838 ns: 1.14x slower (+14%)   |
+------------------------+---------------+-------------------------------+
| geventpool join        | 3.94 us       | 1.75 us: 2.25x faster (-56%)  |
+------------------------+---------------+-------------------------------+
| gevent spawn kwarg     | 5.50 us       | 12.1 us: 2.19x slower (+119%) |
+------------------------+---------------+-------------------------------+
| geventpool spawn kwarg | 11.3 us       | 15.9 us: 1.41x slower (+41%)  |
+------------------------+---------------+-------------------------------+
| geventraw spawn kwarg  | 3.90 us       | 6.10 us: 1.56x slower (+56%)  |
+------------------------+---------------+-------------------------------+

Not significant (4): eventlet sleep; none spawn; eventlet spawn kwarg; none spawn kwarg

The eventlet, sleep, join and raw tests serve as controls, so we can see
that there's up to ~10% variance between most runs anyway.

CPython 3.6 shows the least variance so those 10-20% improvement
numbers are probably fairly close.

PyPy sadly gets *slower* with this change for reasons that are utterly
unclear.

Compared to 1.3a1 (last benchmark) we're still up to 2-3x slower.

Creation of a raw greenlet shows 2.66us on CPython 3.6.4 vs the 3.65us
I reported in #755.

hashbrowncipher pushed a commit to hashbrowncipher/gevent that referenced this issue Oct 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment