Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Large files compile very slowly in the C compiler #4425

Open
da-woods opened this issue Oct 23, 2021 · 39 comments
Open

[BUG] Large files compile very slowly in the C compiler #4425

da-woods opened this issue Oct 23, 2021 · 39 comments

Comments

@da-woods
Copy link
Contributor

da-woods commented Oct 23, 2021

Describe the bug

It takes a very long time for the C compiler to run on large files like ExprNodes.py and Nodes.py. This is seen in the CI -all tests repeatedly timing out (although partly this is because we can't compile them in parallel on Python 2.7)

One possible indicator is the warning from gcc:

note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without

(although that could potentially be toggled independently or the limit increased)

Obviously it's inevitable that a big Cython file will generate a big C file and that a big C file will take a while to compile, but potentially we could do better.

There's a few gcc flags to try to profile compilation time (https://stackoverflow.com/questions/13559818/profiling-the-c-compilation-process), and they suggest that the module init function (where all the module-level user code goes) is the main culprit (unsurprisingly)

Environment (please complete the following information):

  • Linux, CI, most obviously in Python 2.7

Additional context

I tried a couple of approaches to fix the problem.

  1. First I created small sub-scopes within the module init function. https://github.com/da-woods/cython/tree/morelocaltemps. This didn't achieve any speedup or get rid of the warning, but may be worth using some of the change for other reasons (da-woods@c901416#commitcomment-57239451)
  2. Second, I tried to split each stat at module-level into a separate function (Split up module init function #4386). This gave appreciable speed-ups for large modules. However the PR was very intended as a proof of concept with little attention to code quality....

I think a variant of the second approach is probably worthwhile. My current thought that we shouldn't do it on a "per-stat" basis but maybe give each class creation a separate function (that's easy to do for cdef classes, slightly harder for regular classes). That would likely give the appropriate granularity and keep things grouped in logical units.

Improvements made to mitigate this in Cython 3.1

More efficient string constant storage:

Shorter code generation:

More efficient code object creation that no longer permanently stores tuples in the global module state:

@scoder
Copy link
Contributor

scoder commented Oct 24, 2021

Looking around a bit, it seems that variable tracking is mostly a debugging feature. Would be nice to locally disable it somehow, but I can't find a way to do that. In any case, the fact that large files break that feature shouldn't trouble us.

I feel inclined to give a low priority to this issue. Not zero – if we can reduce the C compiler runtime, cool (also in terms of CO2 emissions), but it's only really an issue for large and very large modules, which are rare for simply practical reasons.

@da-woods
Copy link
Contributor Author

Looking around a bit, it seems that variable tracking is mostly a debugging feature. Would be nice to locally disable it somehow, but I can't find a way to do that. In any case, the fact that large files break that feature shouldn't trouble us.

Yeah - I also looked and couldn't find a way to disable it locally

I feel inclined to give a low priority to this issue. Not zero – if we can reduce the C compiler runtime, cool (also in terms of CO2 emissions), but it's only really an issue for very large modules, which are rare for simply practical reasons.

I agree - I was only really interested because it's causing the python 2.7 -all jobs to time out. I mainly created the issue because I wasn't planning to do any more on it for now and wanted to record where I'd got to before forgetting it. So low priority!

@scoder
Copy link
Contributor

scoder commented Oct 25, 2021

To add some more data points, the slowest test files to compile in clang are actually not those that have the most complex module init functions but a large number of types (including closures) and functions:

compile-c   :  1425.27 sec  (1071,  1.331 / run) - slowest: 'c:test_unicode' (28.38s), 'c:test_coroutines_pep492' (27.53s), 'c:test_asyncgen' (14.11s), 'c:memslice' (13.31s), 'c:exttype_total_ordering' (13.10s), 'c:test_grammar' (12.91s), 'c:fused_def' (12.85s), 'c:numpy_test' (12.06s)

https://github.com/cython/cython/runs/3996856486?check_suite_focus=true

Can't say why that is. clang seems really slow here in comparison to gcc:

compile-c   :   161.93 sec  (1078,  0.150 / run) - slowest: 'c:numpy_test' (20.83s), 'c:test_unicode' (0.48s), 'c:relaxed_strides' (0.47s), 'c:test_coroutines_pep492' (0.46s), 'c:libc_stdio' (0.42s), 'c:behnel4' (0.40s), 'c:embedded' (0.38s), 'c:a_capi' (0.34s)

https://github.com/cython/cython/runs/3996853937?check_suite_focus=true

I checked the CFLAGS and they should say `-O0' in both cases, which should be honoured by both gcc and clang alike as saying "do not spend time optimising the code".

@da-woods
Copy link
Contributor Author

On my laptop (just running test_unicode) I get

clang:

compile-c   :    45.93 sec  (   1, 45.928 / run) - slowest: 'c:test_unicode' (45.93s)
compile-cpp :    42.76 sec  (   1, 42.757 / run) - slowest: 'cpp:test_unicode' (42.76s)

gcc

compile-c   :    57.95 sec  (   1, 57.948 / run) - slowest: 'c:test_unicode' (57.95s)
compile-cpp :    48.91 sec  (   1, 48.909 / run) - slowest: 'cpp:test_unicode' (48.91s)

(that's with CLAGS=-Og, just because -O0 produces a lot of warnings when I use it).

I'd say there's ccache is probably doing something here?

@scoder
Copy link
Contributor

scoder commented Oct 25, 2021 via email

@scoder
Copy link
Contributor

scoder commented Dec 18, 2021

BTW, I agree that class creations should give a good granularity for a split. I think that's worth trying out.

@oscarbenjamin
Copy link
Contributor

For python-flint I notice a 50% slowdown in compilation time for Cython 3.0.0a11 compared to Cython 0.29.32 (flintlib/python-flint#29 (comment)).

It goes from ~1 minute to ~1.5 minutes for a full setup.py build_ext --inplace so it's not a huge problem in python-flint's case but it seems like a regression that I guess could have bigger impact for some other projects.

@da-woods
Copy link
Contributor Author

@oscarbenjamin do you know if that's in the pyx->c stage or the c-->so stage?

@oscarbenjamin
Copy link
Contributor

@oscarbenjamin do you know if that's in the pyx->c stage or the c-->so stage?

Is there an easy way to measure that?

It looks to me like the c-->so stage because I'm only compiling a single extension module and I can see warnings from the C compiler after around 5 seconds i.e.:

$ time python setup.py build_ext --inplace
<snip>
Compiling src/flint/pyflint.pyx because it depends on src/flint/functions.pyx.
[1/1] Cythonizing src/flint/pyflint.pyx
running build_ext

## ----------- takes 5 seconds to get here

building 'flint._flint' extension
INFO: C compiler: x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC

INFO: compile options: '-I/usr/local/include -I/home/oscar/current/sympy/python-flint/.local/venv/include -I/usr/local/include/flint -I/home/oscar/current/sympy/python-flint/.local/venv/include/flint -I/home/oscar/current/sympy/python-flint/.local/venv/include -I/usr/include/python3.8 -c'
INFO: x86_64-linux-gnu-gcc: src/flint/pyflint.c
src/flint/pyflint.c: In function ‘__pyx_pf_5flint_6_flint_14dirichlet_char_4__init__’:
src/flint/pyflint.c:194837:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if (((__pyx_t_5 > __pyx_t_6) != 0)) {
                       ^
src/flint/pyflint.c: At top level:
src/flint/pyflint.c:29053:18: warning: ‘__pyx_f_5flint_6_flint_any_as_fmpz_mpoly’ defined but not used [-Wunused-function]
 static PyObject *__pyx_f_5flint_6_flint_any_as_fmpz_mpoly(CYTHON_UNUSED PyObject *__pyx_v_x) {
                  ^
INFO: x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -g -fwrapv -O2 -Wall build/temp.linux-x86_64-3.8/src/flint/pyflint.o -L/home/oscar/current/sympy/python-flint/.local/venv/lib -L/usr/local/lib -larb -lflint -o /home/oscar/current/sympy/python-flint/src/flint/_flint.cpython-38-x86_64-linux-gnu.so

# ----------- takes 1 minute to get here:

real	0m54.163s
user	0m53.124s
sys	0m0.828s

That's with Cython 0.29.32 and it looks like about 5 seconds cythonising and then 50 seconds in the C compiler.

With Cython 3.0.0a11 it's more like about 5 seconds cythonising and then 70 seconds in the C compiler:

real	1m26.277s
user	1m24.480s
sys	0m1.568s

The C file generated by 3.0.0a11 is about 30% bigger in terms of lines (293186 lines vs 224736) which matches the increased compilation time.

@da-woods
Copy link
Contributor Author

Thanks @oscarbenjamin - you're rough guess at where the cythonization ends and the c compilation begins is probably right. Cython on it's own can be run with cython <source_file_name> and that might be the easiest way to measure Cython vs C.

A few notes:

  • There was a bit of a slowdown in building the lexer(?) due to unicode support - it just vastly increased the size of a certain look-up table we had to build. See cython3.0: 3x compile-time regression compared to 0.29 ; steady growth of compilation time from release to release #3646. We think we clawed most of that back though by changing what was compiled.
  • A big chunk of the extra file size is probably limited API support. I wouldn't expect that to impact compilation time much - the preprocessor should probably be able to eliminate large chunks of code pretty quickly.
  • Personally I'd guess the biggest difference might be because we now default to binding=True (which generates a big chunk more C code, in exchange for more introspectability). That might be a useful thing to check (if you have time to test it).

I have a few ideas about how to speed this up that I'd like to try. I wouldn't expect them to make 3.0 though - they're potentially fairly big changes.

@oscarbenjamin
Copy link
Contributor

It takes about 10 seconds to cythonise:

# cython 3.0.0a11
$ time cython -2 --module-name flint._flint flint/pyflint.pyx
warning: flint/_flint.pxd:18:0: The 'DEF' statement is deprecated and will be removed in a future Cython version. Consider using global variables, constants, and in-place literals instead. See https://github.com/cython/cython/issues/4310
<snip more warnings>
real	0m10.466s
user	0m10.356s
sys	0m0.104s

That part is only slightly faster with 0.29:

# cython 0.29.32
$ time cython -2 --module-name flint._flint flint/pyflint.pyx

real	0m9.191s
user	0m9.076s
sys	0m0.112s

@da-woods
Copy link
Contributor Author

One more thing to try if you have time - can you try the full compilation with the binding compiler directive set to True and then False (I think you can add it to https://github.com/fredrik-johansson/python-flint/blob/d6e239d6bf4871a4d90f389177b805fec5368e98/setup.py#L48).

If you don't have time I can have a look myself in a few days, but it looks like I'd need to work out a few dependencies first, which I'm trying to avoid.

@oscarbenjamin
Copy link
Contributor

One more thing to try if you have time - can you try the full compilation with the binding compiler directive set to True and then False

That makes a big difference actually. With Cython 3.0.0a11 and binding:True I get a 3 minute build. With binding:False it's a 1.5 minute build so that's a 2x speedup.

What exactly does the binding parameter do?

(In the other timings reported below I didn't set the binding parameter at all which I guess means that it defaults to True)

If you don't have time I can have a look myself in a few days, but it looks like I'd need to work out a few dependencies first, which I'm trying to avoid.

It's not as bad as it looks because there is a script to download and build the dependencies that takes about 15 minutes so the full repro is

git clone https://github.com/fredrik-johansson/python-flint.git
cd python-flint/
source bin/activate 
bin/build_dependencies_unix.sh
# Need numpy and cython at this point
time python setup.py build_ext --inplace

(I'm assuming that you're on a Unixy system with compilers etc. Don't try this at home on Windows, although the problem is worse there for mingw64.)

The dependency build is faster if you comment out this part:
https://github.com/fredrik-johansson/python-flint/blob/d6e239d6bf4871a4d90f389177b805fec5368e98/bin/build_dependencies_unix.sh#L183-L185

Having just repeated the repro steps above I found that it actually took 3 minutes rather than the 1.5 minutes I previously reported:

$ time python setup.py build_ext --inplace
...
real	3m10.113s
user	3m8.504s
sys	0m1.468s

I guess the difference is because some parts of the compilation are maybe being cached which makes it a bit difficult to reason about timings. I've just tried repeating this a couple of times with cython 3.0.0a11 and 0.29.32:

touch src/flint/*.pyx
time python setup.py build_ext --inplace

I'm now consistently getting ~3 minutes for 3.0.0a11 and ~1 minute for 0.29.32 so a 3x slowdown. I'm not sure why this is different from previous timings.

I remember now that the reason I noticed a slowdown with cython 3.0.0a11 in the first place was because I was trying to make a coverage-enabled build which is slower in both cases. With cython 0.29:

$ time PYTHON_FLINT_COVERAGE=true python setup.py build_ext --inplace
...
real	1m36.786s
user	1m34.848s
sys	0m1.072s

And with 3.0.0a11 (actually master):

$ time PYTHON_FLINT_COVERAGE=true python setup.py build_ext --inplace
...
real	3m50.796s
user	3m48.324s
sys	0m1.460s

@da-woods
Copy link
Contributor Author

What exactly does the binding parameter do?

It wraps functions in a CyFunction object, which adds a whole load of introspectable attributes. For example things like __annotations__, __defaults__. It makes them look a lot more like Python functions.

It's expected to slow things down because it does generate a lot more code to initialize these at start up. So it isn't worrying that it has an effect, although obviously it would be good if we could minimise the effect.

(In the other timings reported below I didn't set the binding parameter at all which I guess means that it defaults to True)

It defaults to True on Cython 3 and False in 0.29.x. so it's likely one of the more significant differences.

I guess the difference is because some parts of the compilation are maybe being cached which makes it a bit difficult to reason about timings.

Passing --force to setup.py should eliminate most of the caching. No need to rerun everything though.

Thanks

@scoder
Copy link
Contributor

scoder commented Dec 15, 2022

It might also be interesting to see the differences in symbol sizes that the C compiler generates from 0.29.x and 3.0 C sources. You can list those for a given extension module with
nm --print-size --size-sort --radix=d some_module.*so
The largest single symbol is usually the module init function (__pyx_pymod_exec_…). We already make gcc optimise it for size rather than speed, because it's normally only executed once on module import. It would be interesting the see the size differences of the largest symbols for a real project.

@pitrou
Copy link
Contributor

pitrou commented Feb 21, 2024

Given the compilation time problems we're having on PyArrow, I looked at our main workhorse Cython-compiled extension module and get the following symbol size numbers:

(edit: for the record, this is using gcc)

  • in debug mode:
$ nm  --demangle --print-size --size-sort --radix=d  pyarrow/lib.cpython-310-x86_64-linux-gnu.so | tail -n20
0000000002003445 0000000000010149 t __pyx_pf_7pyarrow_3lib_15SparseCOOTensor_8from_scipy(_object*, _object*)
0000000001912935 0000000000010317 t __pyx_pf_7pyarrow_3lib_204table(_object*, _object*, _object*, _object*, _object*, _object*)
0000000002271687 0000000000010507 t __pyx_pf_7pyarrow_3lib_10PythonFile___cinit__(__pyx_obj_7pyarrow_3lib_PythonFile*, _object*, _object*)
0000000000485015 0000000000010534 t __pyx_pf_7pyarrow_3lib_14_PandasAPIShim_34__reduce_cython__(__pyx_obj_7pyarrow_3lib__PandasAPIShim*)
0000000000402299 0000000000010703 t __pyx_f_7pyarrow_3lib__build_info()
0000000002242800 0000000000011019 t __pyx_pf_7pyarrow_3lib_10NativeFile_68download(__pyx_obj_7pyarrow_3lib_NativeFile*, _object*, _object*)
0000000001957916 0000000000011321 t __pyx_pf_7pyarrow_3lib_12TableGroupBy_2aggregate(_object*, _object*, _object*)
0000000001273059 0000000000011418 t __pyx_pf_7pyarrow_3lib_5Array_85__arrow_c_array__(__pyx_obj_7pyarrow_3lib_Array*, _object*)
0000000001931883 0000000000011660 t __pyx_pf_7pyarrow_3lib_208_from_pydict(_object*, _object*, _object*, _object*, _object*)
0000000001758680 0000000000012254 t __pyx_pf_7pyarrow_3lib_11RecordBatch_40__arrow_c_array__(__pyx_obj_7pyarrow_3lib_RecordBatch*, _object*)
0000000000419829 0000000000013497 t __pyx_f_7pyarrow_3lib_14_PandasAPIShim__import_pandas(__pyx_obj_7pyarrow_3lib__PandasAPIShim*, int)
0000000000366641 0000000000013884 t __pyx_f_7pyarrow_3lib_convert_status(arrow::Status const&)
0000000001432614 0000000000015636 t __pyx_pf_7pyarrow_3lib_11StructArray_6from_arrays(_object*, _object*, _object*, _object*, _object*)
0000000002128423 0000000000018415 t __pyx_pf_7pyarrow_3lib_15SparseCSFTensor_6from_numpy(_object*, _object*, _object*, _object*, _object*, _object*)
0000000001090086 0000000000027905 t __pyx_pf_7pyarrow_3lib_176array(_object*, _object*, _object*, _object*, _object*, _object*, int, __pyx_obj_7pyarrow_3lib_MemoryPool*)
0000000005415008 0000000000035392 b __pyx_mstate_global_static
0000000003208716 0000000000104462 t __Pyx_modinit_type_init_code()
0000000002772160 0000000000167129 t __Pyx_CreateStringTabAndInitStrings()
0000000002941621 0000000000259781 t __Pyx_InitCachedConstants()
0000000003315267 0000000000412803 t __pyx_pymod_exec_lib(_object*)
  • in release mode (-O2):
$ nm  --demangle --print-size --size-sort --radix=d  pyarrow/lib.cpython-310-x86_64-linux-gnu.so | tail -n20
0000000001320832 0000000000007236 t __pyx_pw_7pyarrow_3lib_5Table_25cast(_object*, _object* const*, long, _object*)
0000000003106304 0000000000007332 r __pyx_doc_7pyarrow_3lib_18_PandasConvertible_to_pandas
0000000002621056 0000000000007467 t __pyx_pw_7pyarrow_3lib_5Array_23from_buffers(_object*, _object* const*, long, _object*)
0000000001941360 0000000000007472 t __pyx_f_7pyarrow_3lib_convert_status(arrow::Status const&)
0000000002020464 0000000000007479 t __pyx_pw_7pyarrow_3lib_10NativeFile_69download(_object*, _object* const*, long, _object*)
0000000002679152 0000000000007497 t __pyx_f_7pyarrow_3lib__array_like_to_pandas(_object*, _object*, _object*)
0000000002519488 0000000000007583 t __pyx_pf_7pyarrow_3lib_16KeyValueMetadata___init__(__pyx_obj_7pyarrow_3lib_KeyValueMetadata*, _object*, _object*)
0000000001702608 0000000000007622 t __pyx_pw_7pyarrow_3lib_12TableGroupBy_3aggregate(_object*, _object* const*, long, _object*)
0000000001883792 0000000000007765 t __pyx_pw_7pyarrow_3lib_10PythonFile_1__cinit__(_object*, _object*, _object*)
0000000002607376 0000000000008111 t __pyx_pw_7pyarrow_3lib_15SparseCOOTensor_9from_scipy(_object*, _object* const*, long, _object*)
0000000001372256 0000000000008409 t __pyx_pw_7pyarrow_3lib_21FixedShapeTensorArray_5from_numpy_ndarray(_object*, _object* const*, long, _object*)
0000000002629024 0000000000012271 t __pyx_pw_7pyarrow_3lib_11StructArray_7from_arrays(_object*, _object* const*, long, _object*)
0000000001776128 0000000000013574 t __pyx_pw_7pyarrow_3lib_209_from_pydict(_object*, _object* const*, long, _object*)
0000000002593680 0000000000013687 t __pyx_pw_7pyarrow_3lib_15SparseCSFTensor_7from_numpy(_object*, _object* const*, long, _object*)
0000000000491111 0000000000032195 t __Pyx_modinit_type_init_code()
0000000002317424 0000000000033296 t __pyx_pw_7pyarrow_3lib_177array(_object*, _object* const*, long, _object*)
0000000003815264 0000000000035392 b __pyx_mstate_global_static
0000000000251008 0000000000092059 t __Pyx_InitCachedConstants()
0000000000347361 0000000000141382 t __Pyx_CreateStringTabAndInitStrings()
0000000000530742 0000000000423977 t __pyx_pymod_exec_lib(_object*)
  • in release mode, optimized for small size (-Os):
$ nm  --demangle --print-size --size-sort --radix=d  pyarrow/lib.cpython-310-x86_64-linux-gnu.so | tail -n20
0000000002023037 0000000000004957 t __pyx_pw_7pyarrow_3lib_10PythonFile_1__cinit__(_object*, _object*, _object*)
0000000001252361 0000000000005535 t __pyx_pw_7pyarrow_3lib_21FixedShapeTensorArray_5from_numpy_ndarray(_object*, _object* const*, long, _object*)
0000000001981287 0000000000005559 t __pyx_f_7pyarrow_3lib__array_like_to_pandas(_object*, _object*, _object*)
0000000001669293 0000000000005619 t __pyx_pw_7pyarrow_3lib_10NativeFile_69download(_object*, _object* const*, long, _object*)
0000000001534145 0000000000005638 t __pyx_f_7pyarrow_3lib_convert_status(arrow::Status const&)
0000000001875635 0000000000005688 t __pyx_pw_7pyarrow_3lib_5Array_23from_buffers(_object*, _object* const*, long, _object*)
0000000001955185 0000000000005827 t __pyx_pw_7pyarrow_3lib_15SparseCOOTensor_9from_scipy(_object*, _object* const*, long, _object*)
0000000001918580 0000000000005859 t __pyx_pf_7pyarrow_3lib_16KeyValueMetadata___init__(__pyx_obj_7pyarrow_3lib_KeyValueMetadata*, _object*, _object*)
0000000001435273 0000000000006152 t __pyx_pw_7pyarrow_3lib_205table(_object*, _object* const*, long, _object*)
0000000002729056 0000000000006871 r __pyx_k_Convert_to_a_pandas_compatible
0000000002413984 0000000000007332 r __pyx_doc_7pyarrow_3lib_18_PandasConvertible_to_pandas
0000000001488793 0000000000007538 t __pyx_pw_7pyarrow_3lib_209_from_pydict(_object*, _object* const*, long, _object*)
0000000001964279 0000000000008826 t __pyx_pw_7pyarrow_3lib_11StructArray_7from_arrays(_object*, _object* const*, long, _object*)
0000000001924854 0000000000009528 t __pyx_pw_7pyarrow_3lib_15SparseCSFTensor_7from_numpy(_object*, _object* const*, long, _object*)
0000000001831983 0000000000025197 t __pyx_pw_7pyarrow_3lib_177array(_object*, _object* const*, long, _object*)
0000000000480787 0000000000032090 t __Pyx_modinit_type_init_code()
0000000003090272 0000000000035392 b __pyx_mstate_global_static
0000000000250872 0000000000092090 t __Pyx_InitCachedConstants()
0000000000346362 0000000000130626 t __Pyx_CreateStringTabAndInitStrings()
0000000000518452 0000000000456334 t __pyx_pymod_exec_lib(_object*)

The initialization functions are positively huge...

@pitrou
Copy link
Contributor

pitrou commented Feb 21, 2024

Looking at one of the initialization functions, I see a lot of repeated snippets for creating function objects, for example:

  /* "pyarrow/memory.pxi":227
 *
 *
 * def log_memory_allocations(enable=True):             # <<<<<<<<<<<<<<
 *     """
 *     Enable or disable memory allocator logging for debugging purposes
 */
  __pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_7pyarrow_3lib_37log_memory_allocations, 0, __pyx_n_s_log_memory_allocations, NULL, __pyx_n_s_pyarrow_lib, __pyx_d, ((PyObject *)__pyx_codeobj__302)); if (unlikely(!__pyx_t_7)) __PYX_ERR(14, 227, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_7, __pyx_tuple__303);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_log_memory_allocations, __pyx_t_7) < 0) __PYX_ERR(14, 227, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;

It seems that those could generate much shorter code by walking an array of constant-initialized structures, for example (not sure that's flexible enough):

typedef struct {
    PyMethodDef* ml;
    int flags;
    PyObject* qualname;
    PyObject* closure;
    PyObject* module;
    PyObject* globals;
    PyObject* code;
    PyObject* defaults_tuple;
    int f_index;
    int lineno;
} GlobalFunctionDef;

void init_func() {
    GlobalFunctionDef function_defs[] = {
        {&__pyx_mdef_7pyarrow_3lib_37log_memory_allocations, 0,
         __pyx_n_s_log_memory_allocations, NULL, __pyx_n_s_pyarrow_lib,
         __pyx_d, ((PyObject *)__pyx_codeobj__302), __pyx_tuple__303,
         14, 227},
        // ... other functions ...
    };

    for (int i = 0; i < sizeof(function_defs); ++i) {
        GlobalFunctionDef* def = &function_defs[i];
        __pyx_t_7 = __Pyx_CyFunction_New(def->ml, def->flags, def->qualname,
                                         def->closure, def->module, def->globals,
                                         def->code, def->defaults_tuple);
        __Pyx_GOTREF(__pyx_t_7);
        if (def->defaults_tuple) {
            __Pyx_CyFunction_SetDefaultsTuple(__pyx_t_7, def->defaults_tuple);
        }
        if (PyDict_SetItem(__pyx_d, def->qualname, __pyx_t_7) < 0) {
            __PYX_ERR(def->f_index, def->lineno, __pyx_L1_error);
        }
        __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    }

@pitrou
Copy link
Contributor

pitrou commented Feb 21, 2024

I'm looking at the disassembly of __pyx_pymod_exec_lib and something else stands out: more than half of the disassembly output is made of such code snippets:

   aa2a4:	31 c0                	xor    %eax,%eax
   aa2a6:	31 d2                	xor    %edx,%edx
   aa2a8:	31 f6                	xor    %esi,%esi
   aa2aa:	45 31 ff             	xor    %r15d,%r15d
   aa2ad:	48 89 85 b0 fe ff ff 	mov    %rax,-0x150(%rbp)
   aa2b4:	31 c0                	xor    %eax,%eax
   aa2b6:	31 db                	xor    %ebx,%ebx
   aa2b8:	31 c9                	xor    %ecx,%ecx
   aa2ba:	48 89 85 c0 fe ff ff 	mov    %rax,-0x140(%rbp)
   aa2c1:	45 31 db             	xor    %r11d,%r11d
   aa2c4:	45 31 d2             	xor    %r10d,%r10d
   aa2c7:	45 31 f6             	xor    %r14d,%r14d
   aa2ca:	48 89 95 a0 fe ff ff 	mov    %rdx,-0x160(%rbp)
   aa2d1:	45 31 e4             	xor    %r12d,%r12d
   aa2d4:	31 d2                	xor    %edx,%edx
   aa2d6:	45 31 ed             	xor    %r13d,%r13d
   aa2d9:	48 89 b5 a8 fe ff ff 	mov    %rsi,-0x158(%rbp)
   aa2e0:	41 b8 fe d2 04 00    	mov    $0x4d2fe,%r8d
   aa2e6:	48 8d 35 aa 5c 1f 00 	lea    0x1f5caa(%rip),%rsi        # 29ff97 <_fini+0x19db>
   aa2ed:	b8 01 00 00 00       	mov    $0x1,%eax
   aa2f2:	41 b9 01 00 00 00    	mov    $0x1,%r9d
   aa2f8:	e9 17 e7 03 00       	jmp    e8a14 <__pyx_pymod_exec_lib(_object*)+0x670de>
   aa2fd:	45 31 c9             	xor    %r9d,%r9d
   aa300:	45 31 d2             	xor    %r10d,%r10d
   aa303:	45 31 db             	xor    %r11d,%r11d
   aa306:	45 31 e4             	xor    %r12d,%r12d
   aa309:	4c 89 8d b0 fe ff ff 	mov    %r9,-0x150(%rbp)
   aa310:	45 31 ff             	xor    %r15d,%r15d
   aa313:	31 db                	xor    %ebx,%ebx
   aa315:	31 c9                	xor    %ecx,%ecx
   aa317:	4c 89 95 c0 fe ff ff 	mov    %r10,-0x140(%rbp)
   aa31e:	31 d2                	xor    %edx,%edx
   aa320:	45 31 d2             	xor    %r10d,%r10d
   aa323:	45 31 f6             	xor    %r14d,%r14d
   aa326:	4c 89 9d a0 fe ff ff 	mov    %r11,-0x160(%rbp)
   aa32d:	45 31 ed             	xor    %r13d,%r13d
   aa330:	45 31 db             	xor    %r11d,%r11d
   aa333:	41 b8 00 d3 04 00    	mov    $0x4d300,%r8d
   aa339:	4c 89 a5 a8 fe ff ff 	mov    %r12,-0x158(%rbp)
   aa340:	48 8d 35 50 5c 1f 00 	lea    0x1f5c50(%rip),%rsi        # 29ff97 <_fini+0x19db>
   aa347:	45 31 e4             	xor    %r12d,%r12d
   aa34a:	b8 01 00 00 00       	mov    $0x1,%eax
   aa34f:	41 b9 01 00 00 00    	mov    $0x1,%r9d
   aa355:	e9 ba e6 03 00       	jmp    e8a14 <__pyx_pymod_exec_lib(_object*)+0x670de>

and, looking at the sites where these code snippets are jumped from, it seems each of them corresponds to the expansion of __PYX_ERR.

@da-woods
Copy link
Contributor Author

Looking at one of the initialization functions, I see a lot of repeated snippets for creating function objects,
[...]
It seems that those could generate much shorter code by walking an array of constant-initialized structures,

Repeating what I said on the mailing list: it isn't quite as simple as that because the function objects need to be created at the right time in the execution, which might be interspersed with other lines. But there may well still be useful optimizations there.

(I think it might also be tricky based on how we represent these internally which is "something that makes a function object, followed by an assignment". That's more of an excuse than a good technical reason.)

@da-woods
Copy link
Contributor Author

it seems each of them corresponds to the expansion of __PYX_ERR.

To me __PYX_ERR looks fairly simple - it's difficult to see how it could be shrunk (or factored out into a function call) but it may well be worth a look.

@matusvalo
Copy link
Contributor

To me __PYX_ERR looks fairly simple - it's difficult to see how it could be shrunk (or factored out into a function call) but it may well be worth a look.

Is it possible to refactor __PYX_ERR to function call if it contains goto? But what I did is to do following (stupid and broken) refactor of __PYX_MARK_ERR_POS:

image

I tried to compare result size for Cython's ParseTreeTransforms.py file. I was able to shrink the size of .so file from 1.8M to 1.6M with standard clang parameters coming from cythonize command.

@scoder
Copy link
Contributor

scoder commented Feb 21, 2024 via email

@da-woods
Copy link
Contributor Author

But what I did is to do following (stupid and broken) refactor of __PYX_MARK_ERR_POS:

Doesn't quite work: __pyx_filename, __pyx_lineno, and __pyx_clineno are local variables in most functions (but global variables in the initialization functions. So you'd need to take the address of them and pass them into __PYX_MARK_ERR_POS.

@matusvalo
Copy link
Contributor

matusvalo commented Feb 21, 2024

But what I did is to do following (stupid and broken) refactor of __PYX_MARK_ERR_POS:

Doesn't quite work: __pyx_filename, __pyx_lineno, and __pyx_clineno are local variables in most functions (but global variables in the initialization functions. So you'd need to take the address of them and pass them into __PYX_MARK_ERR_POS.

I know that's why I called the refactor stupid and broken. I was just curious how much space can be saved moving to function from macro. My only goal was to get compilable code.

@da-woods
Copy link
Contributor Author

I think it's also worth having another look at the string table generation. I think I might have made it significantly worse as part of the "module state" code refactor and I can see a fairly easy way out of it probably. I'll give that a go later this week.

@pitrou
Copy link
Contributor

pitrou commented Feb 22, 2024

Repeating what I said on the mailing list: it isn't quite as simple as that because the function objects need to be created at the right time in the execution, which might be interspersed with other lines.

Oh, I see. I was probably being a bit simplistic here :-)

@pitrou
Copy link
Contributor

pitrou commented Feb 22, 2024

By the way, something much simpler might be to move each type initialization to its own function, so as to avoid having a single extremely large init function.

@scoder
Copy link
Contributor

scoder commented Feb 22, 2024

By the way, something much simpler might be to move each type initialization to its own function, so as to avoid having a single extremely large init function.

See the ticket description.

scoder pushed a commit that referenced this issue Feb 22, 2024
…H-6018)

The stringtab can be a global constant rather than a function local variable that's largely initialized at runtime.

Storage of cached strings in the module state is converted to a big array of PyObjects, which means some code can be replaced with loops.

See #4425
@da-woods
Copy link
Contributor Author

I'll try to have a look at implementing a "tuple table" and "code object table". I suspect optimizing the int initialization isn't worth it.

I think that may be all the metaphorical low-hanging fruit here though

da-woods added a commit to da-woods/cython that referenced this issue Feb 24, 2024
Hopefully helps with cython#4425.

It reduces binary object size by ~0.5% (module-dependent of course).

Code objects can be initialized after all other Python constants
because no other literals can depend on them. Unfortunately I
wasn't able to use a module-level constant global table for them
because I was fighting with the module name defines.
That's an improvement for the future when we've got further with
module-state.
@matusvalo
Copy link
Contributor

matusvalo commented Feb 24, 2024

Is it possible to refactor __PYX_ERR to function call if it contains goto? But what I did is to do following (stupid and broken) refactor of __PYX_MARK_ERR_POS:

image

I tried to compare result size for Cython's ParseTreeTransforms.py file. I was able to shrink the size of .so file from 1.8M to 1.6M with standard clang parameters coming from cythonize command.

I played a little and it turned out that refactoring __PYX_MARK_ERR_POS to static function is not what reduces the resulting binary size. The main culprit is __LINE__ macro. I found out that Cython can be configured to not generate C line number (and hence avoiding __LINE__ macro) by setting c_line_in_traceback=False. This option is not documented and is available only from Cython.Build.cythonize(). E.g. when I compiled ParseTreeTransforms.py using following setup.py:

from setuptools import setup
from Cython.Build import cythonize

setup(
    name="My hello app",
    ext_modules=cythonize("ParseTreeTransforms.py", c_line_in_traceback=False),
)

I was able to reduce the resulting binary size by 4%.

Note

I tried it only in clang, not sure if it will be the same in GCC/MSVC though.

@scoder
Copy link
Contributor

scoder commented Feb 24, 2024 via email

@da-woods
Copy link
Contributor Author

da-woods commented Feb 24, 2024

I found out that Cython can be configured to not generate C line number (and hence avoiding __LINE__ macro) by setting c_line_in_traceback=False

So it can also be controlled by the CYTHON_CLINE_IN_TRACEBACK C macro (which is off by default). Both need to be on for it to work. The C macro is also undocumented.

I'm not sure quite why there's two controls for it, but that suggests the cline = __LINE__ assignment should probably be guarded by the macro too given that it's pointless most of the time.

Edit:

It's more complicated than this - there's three controls:

  • Options.c_line_in_traceback which can turn it off at the Cythonizing stage
  • CYTHON_CLINE_IN_TRACEBACK which can be set to either enable or disable it at C compile time, or left unset...
  • If CYTHON_CLINE_IN_TRACEBACK is unset then it becomes a runtime option controlled an attribute set on the "cython runtime dict"
    This should probably all be documented somewhere, and maybe made more efficient if possible.

@da-woods
Copy link
Contributor Author

There's a related traceback generation change that also seems to get a decent reduction in size #6032 (This issue is more about compile-time than size, but I suspect what improves one with improve both in this case)

@pitrou
Copy link
Contributor

pitrou commented Feb 24, 2024

Can you confirm that --no-c-in-traceback is the same as Options.c_line_in_traceback?
I've just tried it on PyArrow and it has considerably reduced compile times on Windows:
apache/arrow#40225

@matusvalo
Copy link
Contributor

matusvalo commented Feb 24, 2024

Can you confirm that --no-c-in-traceback is the same as Options.c_line_in_traceback?

Yes I think I can confirm. no-c-in-traceback is logical opposite to c_line_in_traceback

$ grep --color -RI c_line_in_traceback Cython
Cython/Distutils/old_build_ext.py:                    c_line_in_traceback = not no_c_in_traceback,
Cython/Distutils/build_ext.py:            'c_line_in_traceback': not getattr(ext, 'no_c_in_traceback', 0),
Cython/Compiler/Options.py:                         'c_line_in_traceback', 'gdb_debug',
Cython/Compiler/Options.py:    c_line_in_traceback=True,
Cython/Compiler/ModuleNode.py:        c_line_in_traceback=options.c_line_in_traceback)
Cython/Compiler/ModuleNode.py:        if options.c_line_in_traceback:
Cython/Compiler/CmdLine.py:    parser.add_argument("--no-c-in-traceback", dest='c_line_in_traceback', action='store_false', help='bleh')
Cython/Compiler/Tests/TestCmdLine.py:        self.assertEqual(options.c_line_in_traceback, False)
Cython/Compiler/Tests/TestCmdLine.py:        self.check_default_options(options, ['c_line_in_traceback'])
Cython/Compiler/Code.py:    # c_line_in_traceback boolean         append the c file and line number to the traceback for exceptions?
Cython/Compiler/Code.py:    def __init__(self, emit_linenums=True, emit_code_comments=True, c_line_in_traceback=True):
Cython/Compiler/Code.py:        self.c_line_in_traceback = c_line_in_traceback

Again it seems to be undocumented parameter of cython command... :-(

parser.add_argument("--no-c-in-traceback", dest='c_line_in_traceback', action='store_false', help=SUPPRESS)

Moreover it is a mess. We have Options.c_line_in_traceback, no-c-in-traceback which is basically an opposite plu a magicalf macros which somehow affect the C line traceback which I btw I was not able to create (maybe I am just stupid though :-D)

@scoder
Copy link
Contributor

scoder commented Feb 25, 2024 via email

da-woods added a commit that referenced this issue Feb 27, 2024
Hopefully helps with #4425.

It reduces binary object size by ~0.5% (module-dependent of course).

Code objects can be initialized after all other Python constants
because no other literals can depend on them. Unfortunately I
wasn't able to use a module-level constant global table for them
because I was fighting with the module name defines.
That's an improvement for the future when we've got further with
module-state.
@scoder
Copy link
Contributor

scoder commented Feb 27, 2024

Linking to the improvements that we made so far for 3.1:

More efficient string constant storage:
f39526d
368a750 (fixed in f5f83fb)
0d5af7b

Shorter code generation:
e6c621a
6526ecf
9047418

EDIT: Overall, from what I tried, this can reduce the extension module size by 4-10%.

@pitrou
Copy link
Contributor

pitrou commented Feb 27, 2024

Thanks a lot @scoder . The feedback and improvements here are really appreciated.
If there's an easy way to install a development version of Cython, we could give it a quick try on PyArrow.

@da-woods
Copy link
Contributor Author

CyFunction generation: I actually think that most of them could just be pre-generated in a loop. I think think the ones that can't would:

  • have closures (doesn't apply to global scope anyway)
  • have non-literal default values (and we already make this distinction in the code generation)

We'd only be able to generate them in a loop - actually putting them in the module or class dicts would have to be done later since order really does matter. But potentially there's improvements there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants