-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Large files compile very slowly in the C compiler #4425
Comments
Looking around a bit, it seems that variable tracking is mostly a debugging feature. Would be nice to locally disable it somehow, but I can't find a way to do that. In any case, the fact that large files break that feature shouldn't trouble us. I feel inclined to give a low priority to this issue. Not zero – if we can reduce the C compiler runtime, cool (also in terms of CO2 emissions), but it's only really an issue for large and very large modules, which are rare for simply practical reasons. |
Yeah - I also looked and couldn't find a way to disable it locally
I agree - I was only really interested because it's causing the python 2.7 -all jobs to time out. I mainly created the issue because I wasn't planning to do any more on it for now and wanted to record where I'd got to before forgetting it. So low priority! |
To add some more data points, the slowest test files to compile in clang are actually not those that have the most complex module init functions but a large number of types (including closures) and functions:
https://github.com/cython/cython/runs/3996856486?check_suite_focus=true Can't say why that is. clang seems really slow here in comparison to gcc:
https://github.com/cython/cython/runs/3996853937?check_suite_focus=true I checked the CFLAGS and they should say `-O0' in both cases, which should be honoured by both gcc and clang alike as saying "do not spend time optimising the code". |
On my laptop (just running clang:
gcc
(that's with I'd say there's |
Ah, yeah, right. We don't have ccache on macOS. Probably something we should change.
|
BTW, I agree that class creations should give a good granularity for a split. I think that's worth trying out. |
For python-flint I notice a 50% slowdown in compilation time for Cython 3.0.0a11 compared to Cython 0.29.32 (flintlib/python-flint#29 (comment)). It goes from ~1 minute to ~1.5 minutes for a full |
@oscarbenjamin do you know if that's in the pyx->c stage or the c-->so stage? |
Is there an easy way to measure that? It looks to me like the
That's with Cython 0.29.32 and it looks like about 5 seconds cythonising and then 50 seconds in the C compiler. With Cython 3.0.0a11 it's more like about 5 seconds cythonising and then 70 seconds in the C compiler:
The C file generated by 3.0.0a11 is about 30% bigger in terms of lines (293186 lines vs 224736) which matches the increased compilation time. |
Thanks @oscarbenjamin - you're rough guess at where the cythonization ends and the c compilation begins is probably right. Cython on it's own can be run with A few notes:
I have a few ideas about how to speed this up that I'd like to try. I wouldn't expect them to make 3.0 though - they're potentially fairly big changes. |
It takes about 10 seconds to cythonise: # cython 3.0.0a11
$ time cython -2 --module-name flint._flint flint/pyflint.pyx
warning: flint/_flint.pxd:18:0: The 'DEF' statement is deprecated and will be removed in a future Cython version. Consider using global variables, constants, and in-place literals instead. See https://github.com/cython/cython/issues/4310
<snip more warnings>
real 0m10.466s
user 0m10.356s
sys 0m0.104s That part is only slightly faster with 0.29: # cython 0.29.32
$ time cython -2 --module-name flint._flint flint/pyflint.pyx
real 0m9.191s
user 0m9.076s
sys 0m0.112s |
One more thing to try if you have time - can you try the full compilation with the If you don't have time I can have a look myself in a few days, but it looks like I'd need to work out a few dependencies first, which I'm trying to avoid. |
That makes a big difference actually. With Cython 3.0.0a11 and What exactly does the binding parameter do? (In the other timings reported below I didn't set the binding parameter at all which I guess means that it defaults to True)
It's not as bad as it looks because there is a script to download and build the dependencies that takes about 15 minutes so the full repro is git clone https://github.com/fredrik-johansson/python-flint.git
cd python-flint/
source bin/activate
bin/build_dependencies_unix.sh
# Need numpy and cython at this point
time python setup.py build_ext --inplace (I'm assuming that you're on a Unixy system with compilers etc. Don't try this at home on Windows, although the problem is worse there for mingw64.) The dependency build is faster if you comment out this part: Having just repeated the repro steps above I found that it actually took 3 minutes rather than the 1.5 minutes I previously reported: $ time python setup.py build_ext --inplace
...
real 3m10.113s
user 3m8.504s
sys 0m1.468s I guess the difference is because some parts of the compilation are maybe being cached which makes it a bit difficult to reason about timings. I've just tried repeating this a couple of times with cython 3.0.0a11 and 0.29.32:
I'm now consistently getting ~3 minutes for 3.0.0a11 and ~1 minute for 0.29.32 so a 3x slowdown. I'm not sure why this is different from previous timings. I remember now that the reason I noticed a slowdown with cython 3.0.0a11 in the first place was because I was trying to make a coverage-enabled build which is slower in both cases. With cython 0.29: $ time PYTHON_FLINT_COVERAGE=true python setup.py build_ext --inplace
...
real 1m36.786s
user 1m34.848s
sys 0m1.072s And with 3.0.0a11 (actually master): $ time PYTHON_FLINT_COVERAGE=true python setup.py build_ext --inplace
...
real 3m50.796s
user 3m48.324s
sys 0m1.460s |
It wraps functions in a CyFunction object, which adds a whole load of introspectable attributes. For example things like It's expected to slow things down because it does generate a lot more code to initialize these at start up. So it isn't worrying that it has an effect, although obviously it would be good if we could minimise the effect.
It defaults to True on Cython 3 and False in 0.29.x. so it's likely one of the more significant differences.
Passing Thanks |
It might also be interesting to see the differences in symbol sizes that the C compiler generates from 0.29.x and 3.0 C sources. You can list those for a given extension module with |
Given the compilation time problems we're having on PyArrow, I looked at our main workhorse Cython-compiled extension module and get the following symbol size numbers: (edit: for the record, this is using gcc)
The initialization functions are positively huge... |
Looking at one of the initialization functions, I see a lot of repeated snippets for creating function objects, for example: /* "pyarrow/memory.pxi":227
*
*
* def log_memory_allocations(enable=True): # <<<<<<<<<<<<<<
* """
* Enable or disable memory allocator logging for debugging purposes
*/
__pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_7pyarrow_3lib_37log_memory_allocations, 0, __pyx_n_s_log_memory_allocations, NULL, __pyx_n_s_pyarrow_lib, __pyx_d, ((PyObject *)__pyx_codeobj__302)); if (unlikely(!__pyx_t_7)) __PYX_ERR(14, 227, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_7);
__Pyx_CyFunction_SetDefaultsTuple(__pyx_t_7, __pyx_tuple__303);
if (PyDict_SetItem(__pyx_d, __pyx_n_s_log_memory_allocations, __pyx_t_7) < 0) __PYX_ERR(14, 227, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; It seems that those could generate much shorter code by walking an array of constant-initialized structures, for example (not sure that's flexible enough): typedef struct {
PyMethodDef* ml;
int flags;
PyObject* qualname;
PyObject* closure;
PyObject* module;
PyObject* globals;
PyObject* code;
PyObject* defaults_tuple;
int f_index;
int lineno;
} GlobalFunctionDef;
void init_func() {
GlobalFunctionDef function_defs[] = {
{&__pyx_mdef_7pyarrow_3lib_37log_memory_allocations, 0,
__pyx_n_s_log_memory_allocations, NULL, __pyx_n_s_pyarrow_lib,
__pyx_d, ((PyObject *)__pyx_codeobj__302), __pyx_tuple__303,
14, 227},
// ... other functions ...
};
for (int i = 0; i < sizeof(function_defs); ++i) {
GlobalFunctionDef* def = &function_defs[i];
__pyx_t_7 = __Pyx_CyFunction_New(def->ml, def->flags, def->qualname,
def->closure, def->module, def->globals,
def->code, def->defaults_tuple);
__Pyx_GOTREF(__pyx_t_7);
if (def->defaults_tuple) {
__Pyx_CyFunction_SetDefaultsTuple(__pyx_t_7, def->defaults_tuple);
}
if (PyDict_SetItem(__pyx_d, def->qualname, __pyx_t_7) < 0) {
__PYX_ERR(def->f_index, def->lineno, __pyx_L1_error);
}
__Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
} |
I'm looking at the disassembly of aa2a4: 31 c0 xor %eax,%eax
aa2a6: 31 d2 xor %edx,%edx
aa2a8: 31 f6 xor %esi,%esi
aa2aa: 45 31 ff xor %r15d,%r15d
aa2ad: 48 89 85 b0 fe ff ff mov %rax,-0x150(%rbp)
aa2b4: 31 c0 xor %eax,%eax
aa2b6: 31 db xor %ebx,%ebx
aa2b8: 31 c9 xor %ecx,%ecx
aa2ba: 48 89 85 c0 fe ff ff mov %rax,-0x140(%rbp)
aa2c1: 45 31 db xor %r11d,%r11d
aa2c4: 45 31 d2 xor %r10d,%r10d
aa2c7: 45 31 f6 xor %r14d,%r14d
aa2ca: 48 89 95 a0 fe ff ff mov %rdx,-0x160(%rbp)
aa2d1: 45 31 e4 xor %r12d,%r12d
aa2d4: 31 d2 xor %edx,%edx
aa2d6: 45 31 ed xor %r13d,%r13d
aa2d9: 48 89 b5 a8 fe ff ff mov %rsi,-0x158(%rbp)
aa2e0: 41 b8 fe d2 04 00 mov $0x4d2fe,%r8d
aa2e6: 48 8d 35 aa 5c 1f 00 lea 0x1f5caa(%rip),%rsi # 29ff97 <_fini+0x19db>
aa2ed: b8 01 00 00 00 mov $0x1,%eax
aa2f2: 41 b9 01 00 00 00 mov $0x1,%r9d
aa2f8: e9 17 e7 03 00 jmp e8a14 <__pyx_pymod_exec_lib(_object*)+0x670de>
aa2fd: 45 31 c9 xor %r9d,%r9d
aa300: 45 31 d2 xor %r10d,%r10d
aa303: 45 31 db xor %r11d,%r11d
aa306: 45 31 e4 xor %r12d,%r12d
aa309: 4c 89 8d b0 fe ff ff mov %r9,-0x150(%rbp)
aa310: 45 31 ff xor %r15d,%r15d
aa313: 31 db xor %ebx,%ebx
aa315: 31 c9 xor %ecx,%ecx
aa317: 4c 89 95 c0 fe ff ff mov %r10,-0x140(%rbp)
aa31e: 31 d2 xor %edx,%edx
aa320: 45 31 d2 xor %r10d,%r10d
aa323: 45 31 f6 xor %r14d,%r14d
aa326: 4c 89 9d a0 fe ff ff mov %r11,-0x160(%rbp)
aa32d: 45 31 ed xor %r13d,%r13d
aa330: 45 31 db xor %r11d,%r11d
aa333: 41 b8 00 d3 04 00 mov $0x4d300,%r8d
aa339: 4c 89 a5 a8 fe ff ff mov %r12,-0x158(%rbp)
aa340: 48 8d 35 50 5c 1f 00 lea 0x1f5c50(%rip),%rsi # 29ff97 <_fini+0x19db>
aa347: 45 31 e4 xor %r12d,%r12d
aa34a: b8 01 00 00 00 mov $0x1,%eax
aa34f: 41 b9 01 00 00 00 mov $0x1,%r9d
aa355: e9 ba e6 03 00 jmp e8a14 <__pyx_pymod_exec_lib(_object*)+0x670de> and, looking at the sites where these code snippets are jumped from, it seems each of them corresponds to the expansion of |
Repeating what I said on the mailing list: it isn't quite as simple as that because the function objects need to be created at the right time in the execution, which might be interspersed with other lines. But there may well still be useful optimizations there. (I think it might also be tricky based on how we represent these internally which is "something that makes a function object, followed by an assignment". That's more of an excuse than a good technical reason.) |
To me |
Is it possible to refactor I tried to compare result size for Cython's |
it seems each of them corresponds to the expansion of `__PYX_ERR`.
That macro is marked as cold code, which is probably why you see a bulk of them stuffed together in a cold code section, apart from the rest of the init function implementation (which, BTW, is marked as "prefer short code over speed").
That said, it might be worth trying if wrapping this tiny bit of code in a function makes a visible difference. It's cold code, so it probably won't get inlined and might save a few bytes of code space per call. And as you noted, there can be loads of calls. We can still limit this to the scope of the module init function, where it hurts most in large modules.
|
Doesn't quite work: |
I know that's why I called the refactor stupid and broken. I was just curious how much space can be saved moving to function from macro. My only goal was to get compilable code. |
I think it's also worth having another look at the string table generation. I think I might have made it significantly worse as part of the "module state" code refactor and I can see a fairly easy way out of it probably. I'll give that a go later this week. |
Oh, I see. I was probably being a bit simplistic here :-) |
By the way, something much simpler might be to move each type initialization to its own function, so as to avoid having a single extremely large init function. |
See the ticket description. |
I'll try to have a look at implementing a "tuple table" and "code object table". I suspect optimizing the int initialization isn't worth it. I think that may be all the metaphorical low-hanging fruit here though |
Hopefully helps with cython#4425. It reduces binary object size by ~0.5% (module-dependent of course). Code objects can be initialized after all other Python constants because no other literals can depend on them. Unfortunately I wasn't able to use a module-level constant global table for them because I was fighting with the module name defines. That's an improvement for the future when we've got further with module-state.
I played a little and it turned out that refactoring from setuptools import setup
from Cython.Build import cythonize
setup(
name="My hello app",
ext_modules=cythonize("ParseTreeTransforms.py", c_line_in_traceback=False),
) I was able to reduce the resulting binary size by 4%. Note I tried it only in clang, not sure if it will be the same in GCC/MSVC though. |
`c_line_in_traceback=False`
I was able to reduce the resulting binary size by 4%.
That's quite possible, because it reduces the number of different error
handling cases from (more or less) the number of executed C source lines to
(more or less) the number of Python source lines. That allows the C
compiler to collapse them into substantially less code.
Maybe we can reorganise the __LINE__ setting somehow, to allow collapsing
at least the rest of the error position setup?
That said, many libraries are probably quite happy without C code lines in
stack traces.
|
So it can also be controlled by the I'm not sure quite why there's two controls for it, but that suggests the Edit: It's more complicated than this - there's three controls:
|
There's a related traceback generation change that also seems to get a decent reduction in size #6032 (This issue is more about compile-time than size, but I suspect what improves one with improve both in this case) |
Can you confirm that |
Yes I think I can confirm.
Again it seems to be undocumented parameter of cython command... :-( cython/Cython/Compiler/CmdLine.py Line 157 in e6c621a
Moreover it is a mess. We have |
I'm not sure quite why there's two controls for it,
For historic reasons, as always. It used to add a lot of C code overhead before we changed the ERR macro to include it, that's why there is a code generation option for it (or more than one, also for historic reasons of increasing usability). And the feature itself can be turned on and off with a C compile time macro, which seems the most reasonable to me. It's a debug option for people interested in the C code, after all.
but that suggests the `cline = __LINE__` assignment should probably be guarded by the macro too
+1
|
Hopefully helps with #4425. It reduces binary object size by ~0.5% (module-dependent of course). Code objects can be initialized after all other Python constants because no other literals can depend on them. Unfortunately I wasn't able to use a module-level constant global table for them because I was fighting with the module name defines. That's an improvement for the future when we've got further with module-state.
Thanks a lot @scoder . The feedback and improvements here are really appreciated. |
CyFunction generation: I actually think that most of them could just be pre-generated in a loop. I think think the ones that can't would:
We'd only be able to generate them in a loop - actually putting them in the module or class dicts would have to be done later since order really does matter. But potentially there's improvements there. |
More marginal size reductions. See #4425
Describe the bug
It takes a very long time for the C compiler to run on large files like ExprNodes.py and Nodes.py. This is seen in the CI
-all
tests repeatedly timing out (although partly this is because we can't compile them in parallel on Python 2.7)One possible indicator is the warning from gcc:
(although that could potentially be toggled independently or the limit increased)
Obviously it's inevitable that a big Cython file will generate a big C file and that a big C file will take a while to compile, but potentially we could do better.
There's a few gcc flags to try to profile compilation time (https://stackoverflow.com/questions/13559818/profiling-the-c-compilation-process), and they suggest that the module init function (where all the module-level user code goes) is the main culprit (unsurprisingly)
Environment (please complete the following information):
Additional context
I tried a couple of approaches to fix the problem.
I think a variant of the second approach is probably worthwhile. My current thought that we shouldn't do it on a "per-stat" basis but maybe give each class creation a separate function (that's easy to do for
cdef
classes, slightly harder for regular classes). That would likely give the appropriate granularity and keep things grouped in logical units.Improvements made to mitigate this in Cython 3.1
More efficient string constant storage:
Shorter code generation:
More efficient code object creation that no longer permanently stores tuples in the global module state:
The text was updated successfully, but these errors were encountered: