Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism during Cython compilation of third party projects #2841

Closed
mzpqnxow opened this Issue Feb 13, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@mzpqnxow
Copy link

mzpqnxow commented Feb 13, 2019

Is there any easy way to parallelize Cython compilation? I have a system with 192CPUs and 1TB RAM but it still takes ~7m to build Pandas (even with ccache)

I opened an issue with Pandas and the devs pointed to Cython, though there wasn’t any certainty that anything could be done to improve the situation within Cython.

Is there any easy way to cause a third party project to Cythonize in parallel? Is this even a Cython issue?

Thanks, I apologize if this isn’t a Cython issue, or is one that just can’t be trivially addresses by Cython devs. I’ve not had much opportunity to work with Cython which I hope explains my lack of direction in trying to chase this down.

I did modify the Pandas setup.py to utilize ‘nthreads’ but I’m looking more for something like ‘make -j’ behavior- it seems ‘nthreads’ only optimizes for one file at a time, whereas I’m looking for multiple files being compiled simultaneously

I’m wondering if the logic in each third party project would need to be rewritten to look more like this: https://github.com/cython/cython/blob/master/Tools/cystdlib.py

@mzpqnxow

This comment has been minimized.

Copy link
Author

mzpqnxow commented Feb 13, 2019

Oh and just to be clear- I’m not talking about runtime parallelism, the GIL, etc.. I’m purely talking about the process of Cythonizing code that utilizes Cython, which as I understand it takes place at build/install time of the third party project (in this example, Pandas, probably the project using Cuthon most heavily)

And I understand it’s not for Cython devs to PR to third parties if such functionality exists- I’m happy to do that myself.

@scoder

This comment has been minimized.

Copy link
Contributor

scoder commented Feb 13, 2019

In Python 3.4(or 3.5?) and later, you can use something like python setup.py build_ext -j20 to run 20 parallel build jobs for the extension module builds. Then normally building the whole package afterwards should reuse the already compiled extensions.

Other than that, yes, passing nthreads to cythonize() will also parallelise the code generation of Cython, although that is usually dominated by the C compilation time. Both options together should get you pretty far.

@mzpqnxow

This comment has been minimized.

Copy link
Author

mzpqnxow commented Feb 17, 2019

Fair enough, I think for my usage (Python 2.7, deploying via virtualenv/requirements.txt) there's probably not much I can do without forking cython and making the build use a process pool. Thank you for the help, I will close this out now

@mzpqnxow mzpqnxow closed this Feb 17, 2019

@robertwb

This comment has been minimized.

Copy link
Contributor

robertwb commented Feb 18, 2019

You shouldn't have to fork Cython, as cythonize(..., nthreads=N) already uses a process pool to compile multiple files in parallel. Different logic is needed to get setuptools to do the C compilation in parallel (for older version of Python).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.