Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider FOR_ITER family for specialization? #67

Closed
Fidget-Spinner opened this issue Jul 13, 2021 · 29 comments
Closed

Consider FOR_ITER family for specialization? #67

Fidget-Spinner opened this issue Jul 13, 2021 · 29 comments
Labels
epic-specialization More specialization work for 3.12

Comments

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jul 13, 2021

I haven't properly profiled anything. Throwing this idea here to see what people think.

Consider this fairly common pattern in Python:

for i in range(len(x)):
    ...

Bytecode snippet:

           6 GET_ITER
     >>    8 FOR_ITER                 2 (to 14)

Anecdotally, some users are surprised at how much overhead this has. For most simple for loops, users intend for the range objects to be used as the equivalent for(int i=0; i < x; i++) loop in C. The range objects are created then thrownaway immediately.

In those cases, calling next(range_iterator) in FOR_ITER is unnecessary overhead. We can unbox this into a simple PyLong object, then PyLong_Add on it. This can be a FOR_ITER_RANGE opcode.

This will have to be extremely conservative. We can only implement this on range objects with a reference count of 1 (ie used only for the for loop) and they must be the builtin range and not some monkeypatched version. Without the compiler’s help, the following would be very dangerous to optimize:

x = iter(range(10))
for _ in x:
    next(x)

We can also do the same with lists and tuples (FOR_ITER_LIST and FOR_ITER_TUPLE), but use the the native PyTuple/List_GetItem instead of iterator protocol. But I'm not sure how common something like for x in (1,2,3) is. So maybe those aren't worth it (they're also much harder to rollback if the optimization breaks halfway).

FOR_ITER isn't a common instruction. But I think when it's used it matters, because it tends to be rather loopy code.

What do y'all think?

@Fidget-Spinner Fidget-Spinner changed the title Consider FOR_ITER family of specialization? Consider FOR_ITER family for specialization? Jul 13, 2021
@gvanrossum
Copy link
Collaborator

I definitely don't think that for x in (1, 2, 3) or for x in [1, 2, 3] are common enough to bother. I have been eyeing for i in range() myself, but before the specialization framework was implemented it seemed tricky.

There may have been some previous hacks, see comments by user @heeres in #42. IIRC the idea was to reuse the integer object rather than allocating a new one. However, making sure that there are no other users of that same object is tricky.

Of course, if we ever get tagged integers, that would be the right approach.

Also, in case you hadn't looked yet, range() already has two entirely distinct implementations depending on whether the end points fit in a machine word. But of course this just saves doing the increment using integer objects, the result still has to be allocated. (A free list might help too?)

@Fidget-Spinner
Copy link
Collaborator Author

Fidget-Spinner commented Jul 15, 2021

Oops I didn't see that issue. It's cool that the solutions and problems somewhat converge with this issue. Also those are some really nice hacks by Heeres!

You mentioned trying out FOR_ITER/{STORE_NAME|STORE_FAST} combined instruction. This makes me wonder if the super instruction would be faster than individually specializing FOR_ITER and STORE. Or maybe the fastest would be specializing that combined instruction ;)?

@gvanrossum
Copy link
Collaborator

We’ll, there’s not much to specialize about STORE_FAST, and STORE_NAME isn’t worth it (a for loop at the module level?!) So the super-instruction seems a good idea; specializing it for range() seems good too.

You might try to measure how often ‘for i in range(…)’ actually occurs. I think in my own code I use enumerate() more than range(), but that is harder to specialize (or is it?). It would be a nice win to avoid needing the tuple there (this would need help from the compiler).

@markshannon
Copy link
Member

One possible way to speed up tight for loops is to add a FOR_END opcode, which would be like FOR_ITER but reverse the direction of branch. It would not branch when the iterator is exhausted.
So for x: body would compile to:

    FOR_ITER end
loop:
    body 
    FOR_END loop
end:

instead of

loop:
    FOR_ITER end
    body 
    JUMP_ABSOLUTE loop
end:

I don't know if this would be profitable, but might be worth an experiment.

@Fidget-Spinner
Copy link
Collaborator Author

Fidget-Spinner commented Jul 20, 2021

After some hacking around, I think my original approach is pretty naive -- next(iter) seems to be very little overhead (just calling tp_iternext which increments and creates a PyLong).

User heeres is right that most of the overhead is from the memory allocation. I'm surprised though -- I expected with how pymalloc works, that _PyObject_Malloc would be very cheap.

I will attempt to use the specializing interpreter features to apply their optimizations in a safer manner (with full credits given to them of course). And hopefully, it will have better results.

@Fidget-Spinner
Copy link
Collaborator Author

Fidget-Spinner commented Jul 21, 2021

I'm abandoning this idea. I integrated the PyLong reuse hacks above into a specialized instruction, and it produces a significant pessimization on ranges with small longs [1].

Since there's a small int cache in CPython, any long from -5 to 256 is essentially free. So the overhead of an adaptive instruction is significant. I would expect such small ranges to cover most use cases.

Additionally, to get the full test suite to pass, I had to be extremely conservative. This means more than half the time it still had to allocate a new long. I don't see this going too far without the compiler's help.

There is some speedup on very large ranges, but it was only 10% on a microbench [2]. I'm not sure what settings user heeres was testing with (eg. debug, release, PGO?) for 50%, and I might have implemented things wrongly, so I can't provide an explanation. One guess is that pymalloc is quite efficient after PGO. Also if you're dealing with such large ranges, you're probably better off using numpy, cython, numba etc. Overall, even if we can fix the other shortcomings, for the complexity it introduces it's not profitable.

TLDR: Combined instructions or Mark's rearranged instructions look like a better fit. I would think FOR_ITER + LOAD_FAST (woops, STORE_FAST) would show benefits. But they cross basic blocks, so optimization is more difficult. Maybe interpreter specializations can be re-looked at again for late-stage optimizations.

Note: All results were obtained on Windows with PGO. Also I know I should use pyperf instead, but I was short on time 😉 . Sorry.

[1]

python.exe -m timeit "for _ in range(0, 256): pass"
Before:
100000 loops, best of 5: 2.17 usec per loop
Specialized:
100000 loops, best of 5: 2.45 usec per loop

[2]

python.exe -m timeit "for _ in range(256, 10000): pass"
Before:
2000 loops, best of 5: 138 usec per loop
Specialized:
2000 loops, best of 5: 126 usec per loop

Commit: Fidget-Spinner/cpython@e5393d6

@gvanrossum
Copy link
Collaborator

Thanks for doing all the research!

If you want to combine FOR_ITER and LOAD_FAST in the compiler, that could be a super-instruction (#16) added in the assembly phase. I have done a few of these, but never merged any upstream (https://github.com/faster-cpython/cpython/tree/super-instr). Also check out Mark's recent attempt (https://github.com/faster-cpython/cpython/tree/super-instructions).

@Fidget-Spinner
Copy link
Collaborator Author

Fidget-Spinner commented Jul 25, 2021

I'm on a roll producing things with 0 speedups 😉 . FOR_ITER + STORE_FAST showed no speedups too. Seems like the existing PREDICT macro is doing a really good job. On Windows with PGO:

python.exe -m timeit -s
>> "def f(n):
>>  for _ in range(n):
>>   pass
>> " "f(10_000)"

Before:
2000 loops, best of 5: 138 usec per loop
2000 loops, best of 5: 141 usec per loop
2000 loops, best of 5: 138 usec per loop

Superinstruction:
2000 loops, best of 5: 140 usec per loop
2000 loops, best of 5: 137 usec per loop
2000 loops, best of 5: 137 usec per loop

Commit: Fidget-Spinner/cpython@c8a1338

@Fidget-Spinner
Copy link
Collaborator Author

To close the loop:

It seems that Larry tried something similar (though with a different approach) back in 2015: https://bugs.python.org/issue24138. He decided it wasn't worth it in the end.

That spawned some attempts at a long freelist in https://bugs.python.org/issue24165. But the pyperformance numbers were a little murky, so that was abandoned too.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Sep 18, 2021 via email

@Fidget-Spinner
Copy link
Collaborator Author

Once other operations are faster, the overhead of the integer allocation will be a bigger percentage. So maybe it will be noticeable in 3.11 :).

Oh, and one more really cool idea by Serhiy, which I think will probably get merged https://bugs.python.org/issue45026. In short, the PyLong to return is currently calculated by (start + index * step) (index is the index of the iterator, not the return value). The change stores the previous Pylong in start, so the new PyLong uses just (start + step). Iterating ranges are 2x faster in his PR!

@markshannon
Copy link
Member

I'm reopening this. Specialization of FOR_ITER is worthwhile, if only for generators and coroutines. A specialization of FOR_ITER for generators can avoid consuming C stack in the same way as we currently do for calls to Python functions.
If we are going to specialize for generators anyway, additional specializations may be worthwhile.

A specialization for list iterators might be worthwhile.
Handling range is probably out of scope for PEP 659 specialization, as we want to perform scalar replacement of the iterator which is going to be a job for a higher tier optimizer.

@markshannon markshannon reopened this Nov 8, 2021
@markshannon
Copy link
Member

Specialization of generators

Currently FOR_ITER for a generator has to go through the following C call stack:

  • gen_iternext which can't be inlined as it is called through a function pointer
  • gen_send_ex2 which probably isn't inlined as it is too big.
  • _PyEval_EvalFrameDefault which obviously isn't going to be inlined.

This is a lot of overhead for what is just a link-and-branch.

A specialization of FOR_ITER for a generator would need to do the following:

  • Push None to the generator's stack.
  • Push the generator's frame and resume.

Which should be much faster than the current rigmarole.

The only tricky part is returning from the generator which will need to be handled differently from a normal return.
The obvious way to do this is to add a new opcode GEN_RETURN which would pop the frame, jump to the FOR_ITER target and resume.

@markshannon
Copy link
Member

If we are going to specialize for generators, then we will have to pay the price of PEP 659, so we should also specialize for range, list, and any other common cases.

@markshannon
Copy link
Member

markshannon commented Feb 2, 2022

Type classification stats for FOR_ITER running the standard benchmarks:

Class Count Percentage
other 111048830 7.4%
generator 30044759 2.0%
list 555532957 37.0%
tuple 59327730 4.0%
set 66675780 4.4%
str 957966 0.1%
bytes 646850 0.0%
range 516693401 34.4%
itertools (all types) 7129894 0.5%
dict (keys) 1024633 0.1%
dict (items) 61149814 4.1%
dict (values) 225081 0.0%
enumerate 90033090 6.0%

FOR_ITER represents 2% of the total bytecodes executed.

There are no hits for coroutines and async generators.

@markshannon
Copy link
Member

Stats gathered with python/cpython#31079

@sweeneyde
Copy link

sweeneyde commented Feb 20, 2022

    FOR_ITER end
loop:
    body 
    FOR_END loop
end:

instead of

loop:
    FOR_ITER end
    body 
    JUMP_ABSOLUTE loop
end:

I tried out a prototype of this here: python/cpython@main...sweeneyde:for_end

Pyperformance: https://gist.github.com/sweeneyde/17e6e2c426d6a77034e551534a82a3d4
(Geometric mean: 1.01x faster)

Microbenchmark:
./python -m pyperf timeit -s "A = [0.0] * 1000" "for a in A: +a" -o "pr_micro.json"
Mean +- std dev: [main_micro] 11.6 us +- 0.0 us -> [pr_micro] 10.6 us +- 0.2 us: 1.09x faster

Maybe architecturally it really ought to be a peephole optimization like below but there is something wrong and it wasn't working:

case (JUMP_ABSOLUTE targeted at FOR_ITER):
    assert(i == bb->b_iused - 1);
    inst->i_opcode = FOR_END;
    inst->i_target = inst->i_target->b_next;
    bb->b_next = target->i_target;
    bb->b_nofallthrough = 0;
    COPY_INSTR_LOC(*inst, *target);
    --i;
    break;

@brandtbucher
Copy link
Member

bb->b_nofallthrough = 0;

This should become 1, right? Maybe that's why it didn't work?

@sweeneyde
Copy link

bb->b_nofallthrough = 0;

This should become 1, right? Maybe that's why it didn't work?

FOR_END does fall through in one of the two branches, so nofallthrough should become false I believe (an unfortunate double-negative). Since the instruction was a JUMP_ABSOLUTE, nofallthrough should be true already I think? Without that line, some assertion about nofallthrough failed IIRC.

Is it somehow problematic to take a basicblock that had no fallthrough and then add a fallthrough to it?

@brandtbucher
Copy link
Member

brandtbucher commented Mar 1, 2022

FOR_END does fall through in one of the two branches, so nofallthrough should become false I believe (an unfortunate double-negative).

Ha, yep. Time for me to go home.

It may be the bb->b_next = target->i_target; move, then. You might have gotten yourself into a situation where two different blocks fall through to the same target or something:

if condition:
    # This FOR_ITER will be peepholed first, and the jump will be threaded 
    # straight to baz()...
    for i in range(42):
        foo()
        # ...then, I *think* your new peephole optimization will stick a
        # FOR_END here that tries to fall through to baz()...
else:
    bar()
    # ...but we already fall through to it here!
baz()

If so, then adding an explicit JUMP_FORWARD instead might fix it.

@sbrunthaler
Copy link

Hi everyone,

my Python implementation did some specialization for the FOR_ITER instructions, which are documented in the PhD thesis and also in the ECOOP paper. So maybe it's worth to take a look and do some simple measurements whether this still makes sense on much more modern hardware.

For generator optimization, we had some nifty idea that we prototyped in our ZipPy VM, which resulted in massive speedups. Some of the details may not directly apply to CPython, since this was done in Truffle/Graal, but there were some interpreter benefits as well AFAIR. This optimization was accepted at OOPSLA'14, and if there is some interest, I can look it up and provide the necessary details for discussion.

All the best,
--stefan

@sweeneyde
Copy link

My current attempt at replacing JUMP_ABSOLUTE/FOR_ITER with FOR_END at python/cpython@main...sweeneyde:for_end has a flaw in how it makes some tracing have to happen. For example, I have a failing test case test_05_no_pop_tops (test.test_sys_settrace.TraceTestCase), which has the following

def no_pop_tops():      # 0
    x = 1               # 1
    for a in range(2):  # 2
        if a:           # 3
            x = 1       # 4
        else:           # 5
            x = 1       # 6

AssertionError: events did not match expectation:
  (0, 'call')
  (1, 'line')
  (2, 'line')
+ (2, 'line')
  (3, 'line')
  (6, 'line')
+ (2, 'line')
  (2, 'line')
  (3, 'line')
  (4, 'line')
  (2, 'line')
  (2, 'return')

The dis output, with the two differing lines labelled:

>>> from test.test_sys_settrace import no_pop_tops
>>> from dis import dis
>>> dis(no_pop_tops)
116           0 RESUME                   0

117           2 LOAD_CONST               1 (1)
              4 STORE_FAST               0 (x)

118           6 LOAD_GLOBAL              1 (NULL + range)
             18 LOAD_CONST               2 (2)
             20 PRECALL                  1
             24 CALL                     1
             34 GET_ITER
             36 JUMP_FORWARD             8 (to 54)    # <=========== THIS IS NEW
        >>   38 STORE_FAST               1 (a)

119          40 LOAD_FAST                1 (a)
             42 POP_JUMP_IF_FALSE       25 (to 50)

120          44 LOAD_CONST               1 (1)
             46 STORE_FAST               0 (x)
             48 JUMP_FORWARD             2 (to 54)

122     >>   50 LOAD_CONST               1 (1)
             52 STORE_FAST               0 (x)

118     >>   54 FOR_END                 19 (to 38)    # <=========== THIS IS NEW
             56 LOAD_CONST               0 (None)
             58 RETURN_VALUE

So either there's some fix to the tracing logic that I'm missing (I'm not super familiar), or FOR_ITER needs to stick around and duplicate most of the FOR_END logic, which isn't ideal IMO. Any suggestions?

@brandtbucher
Copy link
Member

Perhaps use ADDOP_JUMP_NOLINE for FOR_END? You might need to manually insert a NOP with the correct line number just before it (in the previous block, at the end of the loop), too. Not sure.

This is where the magic happens, so you'll want to start there if you're going to be debugging weird traces. (Looking at it now, maybe that i->i_lineno should really be 0 < i->i_lineno? It seems wrong in its current form.)

@iritkatriel
Copy link
Collaborator

I broke the trace in a similar way with a JUMP_BACK opcode. It turned out that I needed to patch this function in frameobject.c to add the new jump type to the switch. Maybe your problem is there as well.

@sweeneyde
Copy link

sweeneyde commented Apr 12, 2022

Even if range() microbenchmarks are hard to budge, the code path for list iterators is a lot smaller, so there's probably a decent speedup possible from specializing there. Dict items and maybe enumerate (the next most common couple types in pyperformance) could be made to not have to construct a tuple at all, unpacking directly onto the stack. And then generators could speed up as well, if desired.

A FOR_ITER-->FOR_END change seems to make only microscopic improvements at best, but it seems like it could be good to make a decision about whether that's appropriate to do before trying to specialize. I opened a FOR_END issue at python/cpython#91432 and a PR at python/cpython#70016 that got the tracing bugs ironed out.

On the other hand, one wild idea would be to quicken JUMP_BACKWARD to FOR_END_ADAPTIVE when its target is a FOR_ITER. That way, we get the same old readable unquickend bytecode, but then get the benefits of one less opcode per loop. It would make JUMP_BACKWARD take some cache entries, but maybe that's okay, or maybe we could even make it only take up cache entries when it's in that situation.

@brandtbucher
Copy link
Member

Dict items and maybe enumerate (the next most common couple types in pyperformance) could be made to not have to construct a tuple at all, unpacking directly onto the stack.

Just a heads-up: almost all built-in iterators that yield tuples already typically avoid constructing more than one (for exactly this reason), so this may not pay off much.

@sweeneyde
Copy link

With the recent NOTRACE_DISPATCH_SAME_OPARG() fix, I was able to get a plain-old FOR_ITER specialization working for lists and ranges. Code is here: python/cpython#91713

Microbenchmarks show ~1.1x speedup for lists, ~1.5x speedup for small ranges, but ~2.9x speedup for larger ranges! This comes from modifying the target local in place. It's almost like taking two of @Fidget-Spinner's patches and squishing them together.

Pyperformance numbers coming soon.

@mdboom
Copy link
Contributor

mdboom commented Aug 2, 2022

@sweeneyde, @markshannon: With python/cpython#91713 merged, is this issue finished?

@markshannon
Copy link
Member

I think this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-specialization More specialization work for 3.12
Projects
None yet
Development

No branches or pull requests

8 participants