Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding '--enable-optimizations' configure argument #160

Closed
hannosch opened this issue Nov 27, 2016 · 42 comments · Fixed by #404
Closed

Consider adding '--enable-optimizations' configure argument #160

hannosch opened this issue Nov 27, 2016 · 42 comments · Fixed by #404
Labels
Request Request for image modification or feature

Comments

@hannosch
Copy link

hannosch commented Nov 27, 2016

Python has recently added a new configure argument called --enable-optimizations, one of the relevant issues is https://bugs.python.org/issue26359.

The new argument will turn on compile time optimizations relevant to production builds. Currently this turns on profile guided optimizations (PGO). PGO builds will take a lot longer (possible 20-40 minutes) to produce, but the resulting Python binary is about 10% faster at executing Python code (according to Python's new benchmark suite).

The new configure flag is available in the recently released Python 3.6.0b4 and was backported to the 3.5 and 2.7 branches, so it should become available in Python 2.7.13 and 3.5.3.

In the future the Python developers may decide to turn on further optimizations based on this argument, for example link-time optimizations (LTO), though they haven't worked out all the bugs for that one yet.

I think the docker images should use this argument, to benefit from the added optimizations.

Thoughts?

@yosifkit
Copy link
Member

I think it may be worth considering. I see why it takes so much longer; it forces a run of all the tests:

Running code to generate profile data (this can take a while):
make run_profile_task
make[1]: Entering directory '/usr/src/python'
: # FIXME: can't run for a cross build
LD_LIBRARY_PATH=/usr/src/python ./python -m test.regrtest --pgo || true
Run tests sequentially
0:00:00 [  1/405] test_grammar
....

@hannosch
Copy link
Author

Yes, PGO means you need to compile all code once with profiling enabled, than execute some code that exercises the code in a typical way and finally compile everything again using the generated profile data as guidance. Python for now choose to use their regression test suite as the "typical code" to run and generate the profiling data.

Personally I don't think running all their tests is the best thing to do, but that's an issue to bring up with the Python folks and at least in my mind outside the scope of the docker image for it.

@yosifkit
Copy link
Member

It definitely is much longer:

$ time docker build --no-cache 3.6-rc/
...
real	1m38.180s
user	0m0.157s
sys	0m0.069s

$ # after adding `--enable-optimizations`
$ time docker build 3.6-rc/
real	34m21.418s
user	0m0.267s
sys	0m0.124s

$ # the first build even includes the `apt-get update && apt-get install tcl tk` layer while the second used cache

@fjorgemota
Copy link

fjorgemota commented Jan 6, 2017

Well, we have another problem (besides the performance issue) to solve before enabling that flag.

just for curiosity: When running in Debian (changing the Dockerfile related to python:3.6 normally), all went ok.

But..when changing the Dockerfile related to python:3.6-alpine, a segfault occurs while Python runs the test suite needed to create the profile related to the ctypes module:

...
0:05:02 [ 85/405] test_crypt
0:05:03 [ 86/405] test_csv
0:05:03 [ 87/405] test_ctypes
Fatal Python error: Segmentation fault

Current thread 0x00007fb99acd4b28 (most recent call first):
  File "/usr/src/python/Lib/ctypes/test/test_as_parameter.py", line 85 in test_callbacks
  File "/usr/src/python/Lib/unittest/case.py", line 601 in run
  File "/usr/src/python/Lib/unittest/case.py", line 649 in __call__
  File "/usr/src/python/Lib/unittest/suite.py", line 122 in run
  File "/usr/src/python/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/src/python/Lib/unittest/suite.py", line 122 in run
  File "/usr/src/python/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/src/python/Lib/unittest/suite.py", line 122 in run
  File "/usr/src/python/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/src/python/Lib/unittest/suite.py", line 122 in run
  File "/usr/src/python/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/src/python/Lib/unittest/suite.py", line 122 in run
  File "/usr/src/python/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/src/python/Lib/test/support/__init__.py", line 1746 in run
  File "/usr/src/python/Lib/test/support/__init__.py", line 1870 in _run_suite
  File "/usr/src/python/Lib/test/support/__init__.py", line 1904 in run_unittest
  File "/usr/src/python/Lib/test/libregrtest/runtest.py", line 164 in test_runner
  File "/usr/src/python/Lib/test/libregrtest/runtest.py", line 165 in runtest_inner
  File "/usr/src/python/Lib/test/libregrtest/runtest.py", line 129 in runtest
  File "/usr/src/python/Lib/test/libregrtest/main.py", line 343 in run_tests_sequential
  File "/usr/src/python/Lib/test/libregrtest/main.py", line 418 in run_tests
  File "/usr/src/python/Lib/test/libregrtest/main.py", line 490 in _main
  File "/usr/src/python/Lib/test/libregrtest/main.py", line 468 in main
  File "/usr/src/python/Lib/test/libregrtest/main.py", line 532 in main
  File "/usr/src/python/Lib/test/regrtest.py", line 46 in _main
  File "/usr/src/python/Lib/test/regrtest.py", line 50 in <module>
  File "/usr/src/python/Lib/runpy.py", line 85 in _run_code
  File "/usr/src/python/Lib/runpy.py", line 193 in _run_module_as_main
Segmentation fault (core dumped)
...

After that segfault, the tests stop, and so I think the profile needed to enable the optimizations too, though I did not went deeper in that problem..I think it worth the mention. :)

@philtay
Copy link

philtay commented Jan 18, 2017

Please also consider adding the --with-lto flag. It enables link-time optimizations (not implied by --enable-optimizations).

@Kentzo
Copy link

Kentzo commented Jan 18, 2017

@philtay From I read in release notes of the related issue and in the resulting patch, lto is enabled, except for macOS.

@hannosch
Copy link
Author

hannosch commented Jan 18, 2017

@Kentzo the release notes and bugs are a bit hard to follow, but --with-lto is not implied by --enable-optimizations. If you look at the source code of the configure.ac itself, it's fairly clearly stated (https://hg.python.org/cpython/file/3.5/configure.ac#l1249):

# Intentionally not forcing Py_LTO='true' here.  Too many toolchains do not
# compile working code using it and both test_distutils and test_gdb are
# broken when you do managed to get a toolchain that works with it.  People
# who want LTO need to use --with-lto themselves.

And the way that comment sounds, I think it's too early for the general purpose Python docker image to adopt --with-lto. --enable-optimizations is meant to enable all optimizations that the Python core developers think are safe to use, which seems a more reasonable scope to me. If in the future Python's core developers change their mind, they might change what is implied by the flag.

@Kentzo
Copy link

Kentzo commented Jan 18, 2017

@hannosch Interesting, missed that. Thank you.

@philtay
Copy link

philtay commented Jan 19, 2017

@hannosch LTO are pretty safe. They're not enabled by default simply because they're not widely supported (build tools wise). For instance, it's years that Debian & friends have enabled them on Python.

@hannosch
Copy link
Author

@philtay ah, I didn't know that.

Since most of the docker images are based on Debian it stands to reason that at least those variants should be compatible and support the same options as the Debian Python packages.

It does look as though the Debian Python packages do a couple of conditional checks before enabling PGO and LTO, especially checking the architecture and gcc version. They also seem to change a bunch of other options (LTO_CFLAGS, EXTRA_OPT_CFLAGS, AR, RANLIB) as seen in http://bazaar.launchpad.net/~doko/python/pkg2.7-debian/view/head:/rules#L156

So it's probably not as easy as just adding --with-lto to the the dockerfile to get the same result.

@philtay
Copy link

philtay commented Jan 20, 2017

@hannosch Debian enables LTO on the following architectures amd64 arm64 armel armhf i386 powerpc ppc64 ppc64el s390x x32. Docker supports only a subset of them. The LTO_CFLAGS variable is there because they enabled LTO even before official Python support (the --with-lto option). Anyway, this is the original patch from the Server Scripting Languages Optimization team at Intel Corporation (yes, they have such a team!):

http://bugs.python.org/issue25702

@yosifkit
Copy link
Member

As far as I can tell, Debian does not enable LTO on arm64 for python 3.4 or 3.5 (see python2.7, python3.4, and python3.5).

I am also still very hesitant to add the enable-optimizations since that would significantly increase the build time of the current 18 versions/variants that we build. Plus the failure to even build the optimizations on the alpine variant.

python:2.7.13
python:2.7.13-slim
python:2.7.13-alpine
python:2.7.13-wheezy
python:3.3.6
python:3.3.6-slim
python:3.3.6-alpine
python:3.3.6-wheezy
python:3.4.5
python:3.4.5-slim
python:3.4.5-alpine
python:3.4.5-wheezy
python:3.5.2
python:3.5.2-slim
python:3.5.2-alpine
python:3.6.0
python:3.6.0-slim
python:3.6.0-alpine

@roadsideseb
Copy link

I've had the same issue with the Alpine build and Python 3.6. The Alpine Edge version actually has a release candidate build for 3.6.1. The slightly modified way they build the image is by using the system packages for libffi and expat which fixes the segfaults caused by tests related to the ctypes implementations. The build script can be found on the official packages site.

I've opened a PR with the changes that fix it for me. This also adds the optimization flags discussed above.

@tianon
Copy link
Member

tianon commented Apr 25, 2017

Given the huge time and resource increase that --enable-optimizations implies, I'm still -1 on including it unless the real-world benefits can somehow be shown to warrant the massive increase in build times. 😞

@ksze
Copy link

ksze commented Sep 19, 2018

Actually, do we even know how PGO and LTO work? Would the optimizations be CPU-specific? E.g. if you build the image on an 8th-gen Intel Core i7 CPU, could it result in a Python executable that sucks on an AMD Ryzen CPU or on older generations of Intel Core CPUs?

That's something to consider, given that Docker images are supposed to be run on any x86-64 machine.

@tianon
Copy link
Member

tianon commented Sep 19, 2018

@ksze that is a really excellent point we hadn't considered! ❤️

Given that pretty compelling argument, I'm going to close this. 👍

@tianon tianon closed this as completed Sep 19, 2018
@JayH5
Copy link
Contributor

JayH5 commented Sep 20, 2018

I don't think PGO is architecture-specific. At least, I can't find any reference that says it is.

Popular software such as Chrome and Firefox are compiled with PGO on some platforms and I'm pretty sure there aren't architecture-specific (e.g. Haswell/Ryzen/Skylake) binaries of those.

@ksze
Copy link

ksze commented Sep 20, 2018

@JayH5 That sounds reasonable. My initial question was also partly rhetorical. I didn't mean to completely shoot down and close the issue. I was just pointing out something we should investigate.

If we want a clear answer, we may need to ask the GCC/Clang people. I have also heard suggestions that PGO mostly just collects stats about branch hits and rearranges the code to be more cache-efficient, which shouldn't be particularly CPU-specific. An optimization flag like -O3, however, can apparently generate CPU-specific optimizations; I guess that depends on -march and -mtune. Again, best to ask the GCC/Clang gurus.

@tianon Could you re-open the issue?

@Kentzo
Copy link

Kentzo commented Sep 20, 2018

LTO by itself is definitely not CPU-specific.

@tianon
Copy link
Member

tianon commented Sep 20, 2018

That's fair, but my objections from #160 (comment) definitely still stand.

@tianon tianon reopened this Sep 20, 2018
@blopker
Copy link
Contributor

blopker commented Dec 2, 2018

I found out why Python 2.7 was segfaulting in Alpine. The fix that was applied to Python 3 wasn't added to Python 2 as well. The PR I have applies the same fix to Python 2.7 and now there's a green build and no segfaults!

I think regardless if the optimizations are accepted, the fixes for 2.7 should probably be merged. Thoughts?

@blopker
Copy link
Contributor

blopker commented Dec 2, 2018

Alright! I ran some tests over night and here are the results: https://gist.github.com/blopker/9fff37e67f14143d2757f9f2172c8ea0

The Json improvements are pretty impressive, in my opinion. I've seen applications where half the time is spent dumping Json.

I feel like I've beaten this one down enough to have an informed discussion so I'll wait until someone wants to go over the results. Here's the last successful build: https://travis-ci.org/docker-library/python/builds/462346459

Cheers!

@gaby
Copy link

gaby commented Mar 22, 2019

Any updates regarding this?

@yosifkit
Copy link
Member

Nothing has changed since #160 (comment) (see also #357 (comment)).

@gpshead
Copy link

gpshead commented Jul 8, 2019

Every single computer in the world using these docker python packages is wasting 10-20% of their Python process CPU cycles as a result of these being non-optimized builds. Complaining about the build time is missing the entire point. Higher latency and cpu consumption on this single project's end building things is worth ~infinitely less than all of the wasted cpu cycles and carbon consumed from not doing this change around the world.

If you do not like the build time, you can change the profile training run workload. Currently it defaults to effectively the entire test suite. That is overkill and runs unnecessary, useless for profile data, slow tests as part of that. You'll get a similar efficiency boost from an optimizations build if you specify a much smaller profiling workload at build time. When you do your make step, pass in a PROFILE_TASK that is more limited after you configure with --enable-optimizations such as:

make PROFILE_TASK="-m test.regrtest --pgo test_array test_base64 test_binascii test_binhex test_binop test_c_locale_coercion test_csv test_json test_hashlib test_unicode test_codecs test_traceback test_decimal test_math test_compile test_threading test_time test_fstring test_re test_float test_class test_cmath test_complex test_iter test_struct test_slice test_set test_dict test_long test_bytes test_memoryview test_io test_pickle"

The fact that Python started up and ran the test.regrtest test suite launcher is good already - that exercises the bulk of the ceval interpreter bytecode event loop. The goal of a profile task, in this case test suite test selection is merely to exercise specific portions of CPython internals implemented in C such as regular expressions, json & pickle, fstring formatting, unicode, the decimal type, math functions, etc. There is no need to be perfect in that list. If particular tests in there take a long time or are flaky, feel free to exclude them; you'll still see a large benefit overall.

The examples above showing super long builds would be reduced to builds taking probably 3-4x as long instead of 20-30x as long. Do it. Every docker python user in the world would see a huge benefit.

caveats? you may find yourself needing to customize the PROFILE_TASK setting slightly for the oldest CPython builds. differences should be minor and only need to change at most once per X.Y release.

@tianon
Copy link
Member

tianon commented Jul 8, 2019

So if this profiling data doesn't change very often, why isn't it published with the upstream release? Presumably the tests were run during the release process, and the relevant (now "official") profiles could simply be published for downstream consumption? What is the gain in having everyone regenerate the same profiles themselves by running the entire test suite?

@gpshead
Copy link

gpshead commented Jul 8, 2019

The profiling data changes any time any C code changes and depends on the local build configuration and system (pyconfig.h) and specific compiler versions. It is a binary build artifact that can't meaningfully be published.

@gpshead
Copy link

gpshead commented Jul 8, 2019

Case in point on Docker being the odd duck here: All noteworthy Linux distros build their distribution
CPython packages using profile guided optimization. (and have been doing so since before CPython's configure.ac --enable-optimizations flag was added)

@tianon
Copy link
Member

tianon commented Jul 8, 2019

Ok thanks for the added background. To be clear, this is something we want to do, but we are very limited on actual compute resources, which has been our largest hesitation (and most noteworthy distros don't have that same constraint).

Is there an official (or even semi-official) recommendation for a sane limited value to supply to PROFILE_TASK? I'm obviously not an expert in the Python test suite, so I wouldn't feel comfortable maintaining that list myself (and I imagine this is the type of thing that's likely to change somewhat over time).

@gpshead
Copy link

gpshead commented Jul 9, 2019

There's never been anything official, otherwise we'd presumably just put that in as the default. I suggest testing the build speed using that list and if it still takes whatever you deem to be too long, whittle it down. If you look in the build log, it'll list which tests took the longest to run during the profiling step. ex: test_pickle seems to take a long time, it does way more work than is likely to be necessary for profiling purposes; you could omit it if needed.

@tianon
Copy link
Member

tianon commented Jul 9, 2019

Nice, the additional background and assistance is definitely appreciated! ❤️ 👍

I did some testing with this list you've provided (on 3.5, 3.7, and 3.8-rc builds), and the build time on my local (fairly beefy) workstation was still on the order of minutes (between 4-6 here, probably a bit more on our build servers but still entirely reasonable IMO). 🤘

@tianon
Copy link
Member

tianon commented Jul 10, 2019

FWIW, https://bugs.python.org/issue36044 seems relevant too (talking about how the way the tests run isn't exactly indicative of real-world code given the way they force GC, which lends further credence to using a slimmer list). 👍

@gpshead
Copy link

gpshead commented Jul 10, 2019

For posterity I ran some pyperformance suite benchmarks on a CPython 3.7 built without --enable-optimizations, then again with --enable-optimizations using the PROFILE_TASK in your merged PR:

Benchmarks range from 1.02x to 1.28x faster in the optimized build. with the bulk of them being > 1.11x faster. :)

@blopker
Copy link
Contributor

blopker commented Jul 11, 2019

Whoo! Very happy this worked out. Now we don't have to make our own images :)

Thanks everyone!

@gaby
Copy link

gaby commented Jul 11, 2019

Awesome! Are these changes now Live in the latest published images? Ex. Slim-Stretch and Alpine

@ranjeetjangra

This comment has been minimized.

@yosifkit
Copy link
Member

@ranjeetjangra this is the github repo for the Docker image that contains Python. For general help building you'll need to look at stackoverflow or a python (or centos) specific forum or mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request Request for image modification or feature
Projects
None yet