Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

markshannon · 2022-08-15T14:45:25Z

We are adding increasingly uncommon specializations of the LOAD_ATTR and CALL instructions.
Each of these specializations increases our hit-rate, improving the potential for tier-2 optimizations, but often has a negligible or even negative impact on tier-1 (PEP 659) performance as the increasing size of the interpreter makes for more icache misses.

Instead of adding more and more specialization we can add trampoline functions (in bytecode, but not full Python functions) to perform some of the work that would be done in a specialized instructions.

These trampolines are unlikely to boost tier-1 performance, in fact they may make it a bit worse, but they allow much more open ended specializations, and expose the details of the specializations to higher tier optimizers in a way that they can understand without needing custom code for each specialization.

This idea isn't limited to CALL and LOAD_ATTR, but those have the most complex and varied semantics.

Examples:

Calling a Python function with complex arguments or parameters.
When specializing a call, we know both the shape of the arguments and the parameters, allowing us to compute (or lookup for the most common cases) a sequence of operations to move the arguments to the right place in the locals. Having individual specializations for each of these cases would be silly. We could have a custom format describing the transformation packed into the cache, but to do that would need a custom mini-interpreter.
Better to create a trampoline function that does the argument shuffle and then tailcalls into the function.
Calling a Python class.
GH-91095: Specialize calls to normal python classes python/cpython#93221 specializes calling Python classes using a custom specialized instruction and a stub function to clean up after the call. This could be replaced with a more general trampoline specialized instruction. Doing so would allow trampolines to handle differing shapes of arguments, without a proliferation of instructions.

The text was updated successfully, but these errors were encountered:

markshannon · 2022-10-20T16:01:44Z

Another use case for this is avoiding jumping in and out of the bytecode interpreter, e.g. tuple(f(v) for v in seq)
The call to tuple goes into C code, and then repeated calls into PyEval_EvalDefault() via several layers of gen_next().
So, for generators and comprehensions we might want to replace tuple with the something like:

def tuple(seq):
    return LIST_TO_TUPLE(   # Not an actual call, but the instruction
        [ val for val in seq ] # No need for the extra scope, this can be inlined.
    )

One problem with schemes like this is that they prevent specialization, as all calls to tuple end up in the same code.
So we might want several specialized versions depending on the kind of seq.

ericsnowcurrently · 2022-10-21T15:55:15Z

the increasing size of the interpreter makes for more icache misses

Not to side-track from your focus in this issue, but how much of this could be mitigated with special-case eval loops (e.g. via our code generation tooling)?

mdboom · 2023-02-28T15:11:34Z

@markshannon: What's the status of this? Is this still targetted for 3.12?

markshannon · 2023-02-28T16:56:42Z

Probably not, but we might get some of this done for 3.12

markshannon · 2023-08-09T10:07:48Z

python/cpython#107788 changes output of LOAD_ATTR with oparg &1 from NULL attr to attr NULL

If we want to replace LOAD_ATTR oparg &1 with a call, we will need to push an additional shim frame to push the NULL after the call instead of pushing the NULL before the call. It will be a bit more expensive, but we already do something similar in CALL_NO_KW_ALLOC_AND_ENTER_INIT

gvanrossum · 2023-08-09T18:12:58Z

@markshannon Speaking of CALL_NO_KW_ALLOC_AND_ENTER_INIT, how would we translate that into Tier 2 uops? There is no Tier 2 equivalent to goto start_frame (the executor can only choose between goto resume_frame and goto resume_with_error). I'm asking because splitting CALL_PY_EXACT_ARGS also needs to figure out how to goto start_frame -- currently it ends up just skipping the call to _Py_EnterRecursivePy, which is obviously wrong; without copying all of _Py_CheckRecursiveCallPy from ceval.c, I fear I will just have to do tstate->py_recursion_remaining-- and pray.

gvanrossum · 2023-08-09T18:14:27Z

(And why does the start_frame block of code use goto exit_unwind instead of goto error on error?)

gvanrossum · 2023-08-10T00:08:34Z

Also, perhaps more pertinent to this issue, when the superblock projector gains the ability to project through a call (soon: gh-107793), will it need to trace through trampolines too? How will it find the bytecode for the trampoline? (I guess the trampoline needs to be an actual function object, not just a code object -- the design of my draft PR caches function objects indexed by func_version.)

markshannon · 2023-10-04T14:18:21Z

Reviewing python/cpython#110317 got me thinking that we could handle this by defining the long tail specializations in terms of micro-ops, much as we do for common ones like LOAD_ATTR_INSTANCE_VALUE, but generating a single tier1 instruction that does additional dispatch on one of its inline cache entries.

Let's use LOAD_ATTR_WITH_HINT as an example.
Currently we define it as:

macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/5;

so there is plenty of unused space for an addition operand.
For tier1 we could generate LOAD_ATTR_EXTRA which would dispatch to the long-tail specializations, something like:

TARGET(LOAD_ATTR_EXTRA) {
     switch(inner_opcode) {
         case LOAD_ATTR_WITH_HINT:
              /* Generated code for LOAD_ATTR_WITH_HINT goes here */
              break;
        ...

For tier2 we simply emit the micro-ops as we do now.

We would need a special name to indicate that an inline cache entry was to be used as additional dispatching. E.g.
inner_opcode

macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            inner_opcode/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/4;

Maybe with an additional qualifier as well.

uncommon macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            inner_opcode/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/4;

markshannon added 3.12 Things we intend to do for 3.12 epic-specialization More specialization work for 3.12 labels Aug 15, 2022

markshannon mentioned this issue Aug 3, 2023

Better specialization of calls, post introduction of PRECALL. #267

Closed

markshannon mentioned this issue Oct 4, 2023

gh-93911: Specialize LOAD_ATTR for custom __getattr__ and __getattribute__ python/cpython#93988

Merged

markshannon mentioned this issue Oct 24, 2023

New Paper from Mozilla #630

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

markshannon commented Aug 15, 2022

markshannon commented Oct 20, 2022

ericsnowcurrently commented Oct 21, 2022

mdboom commented Feb 28, 2023

markshannon commented Feb 28, 2023

markshannon commented Aug 9, 2023

gvanrossum commented Aug 9, 2023 •

edited

Loading

gvanrossum commented Aug 9, 2023 •

edited

Loading

gvanrossum commented Aug 10, 2023

markshannon commented Oct 4, 2023

Handle long tail of specialization for LOAD_ATTR and CALL with trampolines. #447

Handle long tail of specialization for LOAD_ATTR and CALL with trampolines. #447

Comments

markshannon commented Aug 15, 2022

Examples:

markshannon commented Oct 20, 2022

ericsnowcurrently commented Oct 21, 2022

mdboom commented Feb 28, 2023

markshannon commented Feb 28, 2023

markshannon commented Aug 9, 2023

gvanrossum commented Aug 9, 2023 • edited Loading

gvanrossum commented Aug 9, 2023 • edited Loading

gvanrossum commented Aug 10, 2023

markshannon commented Oct 4, 2023

Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

gvanrossum commented Aug 9, 2023 •

edited

Loading

gvanrossum commented Aug 9, 2023 •

edited

Loading