Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Consider splitting _lib module into several parts #40166

Open
pitrou opened this issue Feb 20, 2024 · 9 comments
Open

[Python] Consider splitting _lib module into several parts #40166

pitrou opened this issue Feb 20, 2024 · 9 comments

Comments

@pitrou
Copy link
Member

pitrou commented Feb 20, 2024

Describe the enhancement requested

When reading the logs of a wheel build on Windows I noticed these lines:

  lib.cpp
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): warning C4200: nonstandard extension used: zero-sized array in struct/union [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): message : This member will be ignored by a defaulted constructor or copy/move assignment operator [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335145,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335645,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335855,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]

Ignoring what the warnings say, what stands out is that the lib.cpp generated by Cython has at least 335000 lines. This is huge and can obviously lead to enormous compile times, especially if the RAM is not large enough for the C++ compiler to hold the entire intermediate representation(s) in memory.

We should definitely try to split the _lib into smaller parts, in order to alleviate this problem.

(it is also a Cython problem that so much C++ code is generated, but I'm not sure we can fix that).

Component(s)

Python

@pitrou
Copy link
Member Author

pitrou commented Feb 20, 2024

@AlenkaF @jorisvandenbossche @danepitkin I think this would be worth looking into, in the "quality of life" department.

@pitrou
Copy link
Member Author

pitrou commented Feb 20, 2024

(I also posted on the Cython users ML: https://groups.google.com/g/cython-users/c/hr3cFevY46k)

@pitrou
Copy link
Member Author

pitrou commented Feb 20, 2024

Note that our Windows wheel builds routinely take 3 hours, and it may very well be because of this:
https://github.com/ursacomputing/crossbow/actions/runs/7970517386/job/21758260987

@pitrou
Copy link
Member Author

pitrou commented Feb 20, 2024

This might also be related to the AppVeyor timeouts.

@pitrou
Copy link
Member Author

pitrou commented Feb 26, 2024

Related: #40225

@jorisvandenbossche
Copy link
Member

When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (#36411 (comment)), but of course I am also biased because not using Windows and not seeing this issue locally. And improving build times for Ci is definitely important as well.

The previous time we worked on splitting pyarrow.lib I brought up the back-compat issue for people (c)importing from there, see #10162 (comment) and the comments below.
Of course we can decide to break that once in a release, but I would still prefer we have a clearer story about how we recommend to use pyarrow in those cases.

There might also be some smaller things we could already split off that are less controversial / less publicly used (for example benchmark.pxi, although this is only a tiny one-function file and won't help much. A bigger one might be tensor.pxi)

@jorisvandenbossche
Copy link
Member

We should maybe also experiment with ways to do this in a less breaking way. For example, can we still include things in lib.pxd so cimport keeps working, while moving actual implementations out of pyarrow.lib. In pure Python something like that certainly works, but I don't know by heart how cython would deal with that.

@pitrou
Copy link
Member Author

pitrou commented Feb 28, 2024

When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (#36411 (comment)),

I think we should do both :-)

@pitrou
Copy link
Member Author

pitrou commented Mar 13, 2024

Related to quality of life on incremental builds: cython/cython#6070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants