Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling various special compilation optimizations/architectures #49

Closed
jakirkham opened this issue Mar 19, 2016 · 38 comments
Closed

Handling various special compilation optimizations/architectures #49

jakirkham opened this issue Mar 19, 2016 · 38 comments

Comments

@jakirkham
Copy link
Member

Related #27

Building some low level packages benefit significant from special compiler options like enabling SSE and/or AVX instructions. These options may or may not exist for different target architectures. Also, in some cases, these features may end up being leveraged by the OS so smart decisions must be made to make sure we don't incur a penalty. We should really think about how we want to approach this as it will have an effect on things like BLAS, NumPy, SciPy, and other low level libraries that do significant computation.

@jakirkham
Copy link
Member Author

In reply to @msarahan's comment (cc @jjhelmus @aebrahim).

I think it would be better to build multiple features.

I assume you are meaning for optimizations. I agree. IMHO we should stick to keeping packages here and just add new package variants for optimizations ( though @pelson may disagree :) ). I am leaning towards having the variants be totally separate packages as was done with numpy and scipy in conda-recipes, but could be convinced otherwise if someone has a good proposal.

Default would be most compatible, but then have others for more optimizations.

Absolutely. Though for packages like OpenBLAS that allow runtime selection I think we should just build all the options and let it make the right choice at runtime.

We should take this in consideration with MKL too. Right now, I am leaning towards providing OpenBLAS as the default as it has some nice properties though we could discuss other options like Blis in the future. ATLAS is a bit of a tricky one to ensure it is well optimized, but its stability is an asset. Plus, it seems the larger Python community likes this as a default (cc @ogrisel).

It will be hard to build things like cvxopt, etc. without MKL headers and I don't know that we want to be paying for the MKL developer license for headers for all of conda-forge. Maybe some subset of the membership could have them and be the BLAS developer team. Another possibility might be to include the generic BLAS header and see how that goes. As IANAL, it is not clear to me whether this would be permissible.

The Numba team at Continuum is very interested in this. I'll try to get them involved.

Cool, it would be great to chat with them. We should probably move the optimization discussion to this issue ( #49 ) as it sounds like we all agree with the current build here and making these adjustments will come into play later anyways.

@jjhelmus
Copy link
Contributor

Just a quick note that Intel does offer a "Community" licensed MKL. From what I recall Intel has offered to provide the NumPy core developers access to this library in the past to create binary released but this option was rejected due to the licensing terms. Specifically the binary would no long be a BSD only licensed product, could cause issues if combined with a GPL or similarly licensed software, and the distributor (us) must indemnify Intel against any law suits.

@pelson
Copy link
Member

pelson commented Mar 25, 2016

IMHO we should stick to keeping packages here and just add new package variants for optimizations ( though @pelson may disagree :) )

To be clear, you are proposing adding variants within existing feedstocks, right? If so, I agree.

Right now, I am leaning towards providing OpenBLAS as the default

Agreed.

without MKL headers and I don't know that we want to be paying for the MKL developer license for headers for all of conda-forge

At this moment in time, we have no option to build against MKL, so that can be rules out for now (though we can look again at this once we have bottomed out on OpenBLAS).

From the outset we should be labelling all packages which use OpenBLAS with an appropriate feature name. It wont be difficult in the future to turn on an extra build matrix dimension which builds OpenBLAS/MKL/other.

@patricksnape
Copy link

Hmm - So if I understand this correctly, we are proposing to have different packages (package-sse2, package-sse3, package-sse-4, package-avx) for all the intrinsics? That seems like it could be a real burden. Particularly if you have a fairly complicated build, and then you need to update it and test it 4 times for each of the intrinsic types. Most build systems just have flags for turning intrinsics on or off - wouldn't this be very similar to features? Or are we suggesting that we also have separate packages for features?

@msarahan
Copy link
Member

Sadly, I don't think there's any way around this. The build system may be able to set flags based on some feature (@ukoethe's proposal at conda/conda#1959 would probably be the right way to do this) - but we'd still need one package build at each feature level (with notable exceptions for packages like OpenBLAS that do runtime dispatch).

@patricksnape
Copy link

Hmm - that's fair enough from a building stand point I suppose - short of kicking off multiple build profiles per feature? It still seems like it could be the source of many future bugs if very complicated build scripts must be duplicated and update 5+ times when the majority of the script will likely be identical.

@msarahan
Copy link
Member

Yeah, @ukoethe's proposal would basically turn that into a one-line Jinja thing. The only other missing part that I'm aware of would be that conda-build-all would need to add this support, and we'd also need to make it so that packages without these features don't get needlessly rebuilt (not sure how many of these there might be)

@jakirkham
Copy link
Member Author

The build system may be able to set flags based on some feature...

Yes, that would be the way to go for sure. We need to be very clever to keep this simple and maintainable. @ukoethe's proposal is probably that ticket out.

...with notable exceptions for packages like OpenBLAS that do runtime dispatch...

This is the easiest case really. 😄 We just need to make sure we are building all the cases into the package so it really has the full parameter range to chose from.

...short of kicking off multiple build profiles per feature

I think we need to decide, which ones are actually worth supporting.

...very complicated build scripts must be duplicated...

That's why jinja templates will be essential to avoid such duplication.

@ukoethe
Copy link

ukoethe commented Mar 25, 2016

As a first step towards better build customization, I implemented some ideas in PR conda/conda-build#848. Please have a look!

@ukoethe
Copy link

ukoethe commented Mar 25, 2016

but we'd still need one package build at each feature level (with notable exceptions for packages like OpenBLAS that do runtime dispatch).

We checked that fftw also does runtime dispatch of SSE and AVX. In fact, any self-respecting numerics library should be able to do it. For those that don't, I agree that features are the most promising solution. However, it needs to be discussed if it is better to append the feature tag to the package name or to the version number.

@ChrisBarker-NOAA
Copy link
Contributor

(package-sse2, package-sse3, package-sse-4, package-avx) for all the
intrinsics?

I'm pretty sure when this was discussed for Numpy that sse2 is pretty much
ubiquitous, and sse3 and 4 didn't make much difference.

Also, that it really only matters for lapack and blas ( and maybe fft). So
we don't have a good reason to support a whole matrix.

And hopefully we use a math lib that does it for us anyway.

Oh, and I though conda forge was building on Continuum's work anyway, so
why not simply follow their lead anyway?

-CHB

That seems like it could be a real burden. Particularly if you have a
fairly complicated build, and then you need to update it and test it 4
times for each of the intrinsic types. Most build systems just have flags
for turning intrinsics on or off - wouldn't this be very similar to
features? Or are we suggesting that we also have separate packages for
features?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#49 (comment)

@patricksnape
Copy link

@ChrisBarker-NOAA @ukoethe Runtime switching of intrinsics is true of large numerical libraries such as FFTW/BLAS, but sadly in my experience segfaults are more the norm. I agree that SSE2 is probably a fairly safe bet, though I'm willing to bet that eventually someone will crop up with an issue with it! Unfortunately, I have a number of packages I use/I want to submit that are significantly improved by increased intrinsics levels so some way of supporting a matrix of intrinsics would be very very useful for me. This may not be the case for Numpy - but that is likely due to the fact that Numpy does not make any use of the instructions added by SSE3/SSE4, rather than SSE3/SSE4 not marking a significant improvement for particular kinds of data inputs.

@ukoethe
Copy link

ukoethe commented Mar 25, 2016

To control SSE and AVX capabilities by features, one needs meta-packages that install a track_features declaration for what is supported. These packages would check the present computer's capabilities upon installation (in the pre-link script) such that the installation fails when the corresponding acceleration is unsupported. However, an additional difficulty with conda's feature design arises: AVX2 implies AVX support etc., but features cannot be organized hierarchically at the moment.

@jakirkham
Copy link
Member Author

Agreed on the features point. Disagree on magic added into the pre-link step.

Some OSes use these instructions for special things that can collide with other programs in unexpected ways. We shouldn't be assuming what works best for a user's system. They should be able to chose this. Of course, by having sane defaults, we should hopefully not have to worry in the base case. User's confident enough to want enable these special instructions should know what they are doing IMHO and have to make a conscious choice.

@ukoethe
Copy link

ukoethe commented Mar 30, 2016

Disagree on magic added into the pre-link step.

I didn't make myself clear. Of course, users should be able to chose AVX acceleration. Assuming there is a meta-package use-avx with track_features: - avx, users would activate acceleration by installing this package first:

conda install use-avx numpy

The pre-link post-link script of use-avx checks if the present system really supports AVX, and refuses to install if not. So, an erroneous installation fails as early as possible, instead of crashing when executing an unsupported instruction later.

@jakirkham
Copy link
Member Author

How would you check this? Try to compile a simple program with the appropriate flags?

@ukoethe
Copy link

ukoethe commented Mar 30, 2016

I'm not an expert on AVX, but wouldn't a little utility program (whose output gives the desired platform information) be the simplest solution? Ah, this must go into the post-link script, but you get the idea. It's the same technique I use in the vc11-runtime script to make sure that the compiler version in the PATH conforms to expectations.

Try to compile is not an option because one cannot rely on a compiler to exist.

@jakirkham
Copy link
Member Author

Try to compile is not an option because one cannot rely on a compiler to exist.

Agreed. That's what I was getting at. 😉

'm not an expert on AVX, but wouldn't a little utility program (whose output gives the desired platform information) be the simplest solution?

The simplest ends up being using the compiler TBMK. Though that is already out. Possibly some compiled program could be used here. Certainly OpenBLAS does this so that would be a place to look, but would be good to find a simpler example. Would be nice if it had a Python interface to make it easier to use with conda or other things here.

Ah, this must go into the post-link script, but you get the idea.

Because of installing the program to determine the configuration?

@jakirkham
Copy link
Member Author

Maybe there is something usable from PeachPy.

@ukoethe
Copy link

ukoethe commented Mar 30, 2016

Ah, this must go into the post-link script, but you get the idea.

Because of installing the program to determine the configuration?

Exactly. BTW, someone in our Lab is working on AVX for VIGRA, so he knows how to check this.

@jakirkham
Copy link
Member Author

Maybe I'll let you guys take a crack at it first then. 😄

@patricksnape
Copy link

Windows has a utliity that tells you processor information.

OSX you can use the command sysctl -a | grep machdep.cpu.features and sysctl -a | grep machdep.cpu.leaf7_features for AVX2.

Linux you can use the command cat /proc/cpuinfo | grep flags

@svenpeter42
Copy link

Some OSes use these instructions for special things that can collide with other programs in unexpected ways.

This doesn't make sense. If the OS tells the userspace that it will save xmm/ymm registers during context switches (XCR0 register) and if the correct flags are set in the CPUID that userspace program is free to use AVX (or SSE for that matter) however it likes.

The correct solution is to do a runtime dispatch like OpenBLAS or FFTW do it. /proc/cpuinfo only parses CPUID which will however still be enough unless you have a very strangely configured system.
(Relying on a CPUID check alone will break your code if it runs on a OS that is not aware of the xmm or ymm registers but when linux shows avx or avx2 in cpuinfo the kernel is aware of those and will set the flags in XCR0 anyways)

See https://github.com/svenpeter42/fastfilters/blob/master/src/avx.cxx which requires https://github.com/svenpeter42/fastfilters/blob/master/src/xgetbv.hxx and https://github.com/svenpeter42/fastfilters/blob/master/src/cpuid.hxx for a really simple check. GCC even supports __builtin_cpu_supports and function multiversioning

@patricksnape
Copy link

@svenpeter42 Totally agree with everything you've said. However, we have no control over these projects as they are third party - so runtime dispatching is out of the question.

We are just trying to do the best we can at providing scientific software with intrinsics enabled and unfortunately extremely large projects like OpenCV do not perform runtime dispatching. So, trying to protect users from installing libraries that don't work on the system is the best we can do. If CPUID is lying to you then I suspect all bets are off.

Finally, it's great that GCC has those checks - but we can't use them on Windows which further complicates things. Especially since your processor may support AVX2 - but if you run on Python 2.x and thus build with VS 2008 then those intrinsics are not supported (AVX2 for example is first supported in VS 2013).

@svenpeter42
Copy link

OpenCV also does runtime dispatching as far as I know :-)
http://docs.opencv.org/2.4/modules/core/doc/utility_and_system_functions_and_macros.html -> checkHardwareSupport

And CPUID doesn't necessarily lie - it merely tells you what the processor is capable of supporting. You need OS support on top of that because the kernel needs to save and restore xmm/ymm registers when a context switch happens.

The code I linked above works fine on Windows as well fwiw. All you need to do to perform is the check is to compile it with something that allows you to query cpuid and xcr0. Then you can run it on whatever outdated software you want to.

Here's my suggestion:

  • Does the package support runtime dispatching? If yes, no need for any further actions. Just compile it with a modern compiler.
  • Otherwise perform a check during installation to see if the CPU and OS support sse/avx/avx2/whatever and chose the right package compiled with the right optimizations.

@patricksnape
Copy link

I'm struggling on documentation here, but I'm willing to go out on a bit of a limb. Posts like this suggest that the widespread use of the checkHardwareSupport is not common in OpenCV usage. Furthermore, that isn't really runtime dispatching, because the setUseOptimized method must be manually called by the user and only performs a boolean action (either all optimisations on, or all off). So if someone has a CPU that supports SSE* but not AVX2 for example (and we build with AVX2), they either get no intrinsics at all, or an Illegal Instruction error.

Your CPUID method looks really useful though - I agree that it would be great to have a tiny Python package that just provides the ability to tell you what intrinsics are supported at runtime. However, I agree with @jakirkham that we would not automatically install software with intrinsics enabled by default as we should allow a user to choose whether they want them or not. We should merely prevent them from installing something compiled with an intrinsic set their hardware/OS does not support. How this is implemented is up for discussion in here I guess - since, as mentioned by @ukoethe, features are not hierarchical and so it could require some changes to the conda infrastructure.

@svenpeter42
Copy link

@patricksnape I just checked some opencv code - i think you are right. The codebase is too convoluted to tell for sure though.
But I think you can just switch between "no optimization at all" and "highest optimization possible". They also seems to compile all source files with sse instructions enabled which will sometimes cause incompatible code when the compiler automatically vectorizes some loops.

Not automatically installing binaries which rely on AVX seems reasonable - I could imagine some users using the same conda environment on different machines (e.g. on a network mount) which may support different features.

FWIW, everything that supports 64bit automatically supports SSE and SSE2 and at least gcc enables this by default.

@jakirkham
Copy link
Member Author

Some OSes use these instructions for special things that can collide with other programs in unexpected ways.

This doesn't make sense. If the OS tells the userspace that it will save xmm/ymm registers during context switches (XCR0 register) and if the correct flags are set in the CPUID that userspace program is free to use AVX (or SSE for that matter) however it likes.

I can't remember the particular instance that caused problems ATM, but if I do I will link you to it.

@jakirkham
Copy link
Member Author

Just another point, we would want to be able to select GPU implementations (or none) in some cases. So, I think this again fits in this feature category where we will want to select at run time, which one is used.

@kyleabeauchamp
Copy link

I wanted to check in on the status of two questions:

  1. What is the baseline "assumed" instruction set available---SSE2?
  2. For more recent instruction sets, are "features" or "recipe copies" the preferred mechanism for handling permutations of builds with more recent instruction sets?

@msarahan
Copy link
Member

SSE3 for Anaconda's new compilers with default options. I think that means the same for conda-forge soon. Until then, I think SSE2 is a safe assumption.

Features in the build/features sense are not recommended. Instead, use the new CB3 variant mechanism, and create metapackages to install particular matching sets of packages. @jjhelmus made a good mockup, but I can't remember where it is posted.

@kyleabeauchamp
Copy link

I bet this is what you're talking about: https://github.com/jjhelmus/selectable_pkg_variants_example

@msarahan
Copy link
Member

Yep, that's exactly it. Hopefully that might work. There's a lot we hope to do to make that whole scheme simpler, but Jonathan's example is the best I've seen so far.

@jakirkham
Copy link
Member Author

As the Conda compilers do a pretty good job optimizing for a high end and low end CPU, we have mostly addressed this issue in a broad way. For more specific optimizations, Conda does have an archspec metapackage ( conda/conda#9930 ), which allows selecting a package based on a specific CPU (if there are different builds of a package). If we have more specific needs for CPU optimization, it may be worth opening a new issue discussing those use cases and how we want to handle them broadly. Going to go ahead and close this one as largely resolved

@h-vetinari
Copy link
Member

@jakirkham
Not sure this is fully ready yet? I'm trying to find out how I could indicate a dependence on a given CPU feature or generation, but conda/conda#9930 doesn't provide much info, and there's nothing I can find in the knowledge base or the archspec-feedstock. Also, @isuruf mentioned less than 2 months ago that this isn't ready yet.

Should we open an issue for documentation?

@beckermr
Copy link
Member

I thought we had one but idk where it is. It is not ready.

@jakirkham
Copy link
Member Author

If there are additional features needed from Conda or Conda-Build, would raise issues on those repos

@ltalirz
Copy link
Member

ltalirz commented May 3, 2022

Is there now some maintainer-facing documentation on how to publish package variants for multiple microarchitectures?

If not, would that be an interesting topic for a sprint at SciPy 2022? (I will be there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests