New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime kernel management #129

Closed
fgvanzee opened this Issue May 17, 2017 · 19 comments

Comments

Projects
None yet
5 participants
@fgvanzee
Contributor

fgvanzee commented May 17, 2017

BLIS currently only allows building support for one architecture (configuration) at a time. In the future, it should allow the user to build support for multiple architectures into the same library. The "correct" architecture will then be selected at runtime according to some method. That method could be CPUID or an equivalent, or it could be set to a default in some other way and then later manually changed by the user. Ideally, runtime kernel management should even allow the user to link his own kernel files at link-time, which he can then switch to using the same procedure for switching among the pre-defined architectures.

This feature will require substantial changes to the build system, primarily configure, Makefile, and the make_defs.mk files. It will also redefine what we think of as a configuration and require a reorganization (and renaming) of the files in the top-level kernels directory. A registry will be needed to associate actual configuration names (e.g. haswell) with multi-architecture configurations aliases (e.g., intel64). Finally, it will require a change to the reference kernel files (as well as a relocation) so that their names can be mangled according to the targeted architecture, allowing us to build one set of reference kernels per supported configuration.

@fgvanzee fgvanzee self-assigned this May 17, 2017

@ShadenSmith

This comment has been minimized.

Contributor

ShadenSmith commented May 20, 2017

When the architecture is decided automatically (e.g., CPUID), will that be done on a per-kernel level, or only the first time a BLIS function is invoked?

I love the idea of this, and my software currently uses BLIS for its BLAS interface when --download-blas-lapack is provided during configuration. My concern is that in many of my use cases, I call BLAS functions for small-ish kernels quite regularly and am concerned about the query overhead if it must be done the ~100K times that one of my kernels would use BLIS.

@iotamudelta

This comment has been minimized.

Contributor

iotamudelta commented May 21, 2017

@ShadenSmith good point! It seems OpenBLAS DTRT, so it may be worthwhile to check their solution?

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented May 21, 2017

@ShadenSmith Thanks for your question. Even after runtime kernel management is implemented (with a CPUID-based heuristic for choosing the kernels at runtime on x86_64-based systems), we will still maintain the ability to choose your configuration (architecture) at configure-time, as BLIS currently allows, which is also done via CPUID.

Could you clarify which application of CPUID you are referring to?

@devinamatthews

This comment has been minimized.

Contributor

devinamatthews commented May 21, 2017

@fgvanzee I believe the question is "will BLIS check cpuid every time dgemm is called?", and from previous discussions I believe the answer is No. The only time you'd want to check more than once is heterogeneous systems like big.LITTLE, and that requires a more streamlined solution than checking cpuid each and every time anyways.

The implementation in TBLIS uses (local) static initialization to perform the check then caches the result. In BLIS, there is already some library initialization stuff that gets performed exactly once that this could piggy-back on to.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented May 21, 2017

@devinamatthews You are correct. The value would be queried once, probably at library initialization, and then cached. However, this behavior could become configurable in the future to accommodate heterogeneous architectures.

@ShadenSmith

This comment has been minimized.

Contributor

ShadenSmith commented May 23, 2017

I believe the question is "will BLIS check cpuid every time dgemm is called?",

Yes, this is what I meant. Thanks for the clarification and the quick response.

@fgvanzee calling just once is great. A very exciting feature!

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Jul 21, 2017

Just a quick update. I haven't made much progress on this issue since early June, but I have recently resumed working on it. Thanks for your patience.

@civodul

This comment has been minimized.

civodul commented Sep 4, 2017

Systems using the GNU libc support indirect functions, notably via the GCC ifunc function attribute. This allows ld.so to select the right variant at load time according to an application-provided resolver, which typically checks for cpuid.

@fgvanzee, is this what you had in mind?

Regardless, thanks a lot for looking into it!

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Sep 6, 2017

@civodul Thanks for chiming in. It's good to know that GNU libc supports this way of selecting functions at runtime (I was unaware). My plans do not depend on this sort of feature, however. I am planning what is hopefully a more portable solution (one that does not rely on GNU libc) that builds all of the necessary object files and symbols into the same library with all the necessary name-mangling built into the build system and source code. Where applicable, architecture-specific functions will be looked up via arrays of function pointers, indexed via special architecture ids values.

The last step of this project, which is not too far off, will be to write the CPUID-based code that maps CPUID return values to the BLIS-specific architecture ids. Other architecture families (e.g. ARM, Power, etc.) will need their own solutions if we are to support auto-detection at runtime (or configure-time for that matter).

@devinamatthews

This comment has been minimized.

Contributor

devinamatthews commented Sep 6, 2017

The last step of this project, which is not too far off, will be to write the CPUID-based code that maps CPUID return values to the BLIS-specific architecture ids.

Please pull the CPUID code from TBLIS, as that should be close to a ready-made solution.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Sep 6, 2017

@devinamatthews Thanks, Devin. I'll definitely take a look when the time comes.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Oct 18, 2017

Quick update.

I've recently pushed a new commit (453deb2) and branch (named 'rt') that finally implements the feature mentioned in this issue. The only thing(s) missing are the heuristics (e.g. CPUID-based code) that allow us to choose an architecture at run-time when multiple configurations (microarchitectures) are included in the build. I'll be working on that next, along with updating the wiki documentation to describe how to add support for a new configuration, either permanently or as a prototype. But for now, using the configure script as you did before should work.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Nov 1, 2017

Additional update.

Commit 2c51356 implements the remainder of the work I had planned for this issue. There will inevitably be cleanup and tweaks going forward, but the core of the effort--support for runtime kernel management as well as the multi-architecture builds + runtime hardware detection feature that many have asked for--has been implemented. If you are inclined, feel free to give it a try. (For example, you could try configuring with the intel64 configuration family and linking against the resulting library on a haswell system and then on a sandybridge system.)

Notice that, for now, the configure-time hardware detection uses different code than that used at runtime. Ideally, I would merge the two so that there is just one set of code to maintain, and also so that the two follow the same rules.

@iotamudelta

This comment has been minimized.

Contributor

iotamudelta commented Nov 2, 2017

@fgvanzee thanks so much for your work on this! Excited to test. What's the best way to compile a BLIS now that contains all possible kernels for a given architecture (i.e., i386 or arm64 or x86_64)? Tried to go through your commits but I am not sure there is a configure_registry selector that would do that out of the box.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Nov 2, 2017

@iotamudelta You're welcome! I'm excited to provide this feature to the community.

Some of your questions may be answered once I write documentation on the new configuration system. (Several of the wikis need to be updated.) I'll get the new documentation set up first, and then merge the rt branch into master.

The short answer is that you need to define a configuration family in the configuration registry (config_registry) that contains the sub-configurations that you want included/supported in the build (and then target that family at configure time, e.g. ./configure familyname). I've already defined two families: amd64 and intel64. However, you can define your own! You only need to make sure that the configuration family has a directory set up inside the config directory. Just take a look at the sub-directories for amd64 or intel64 to get an idea of what is expected of configuration family directories (they require less than a full-fledged sub-configuration).

Also, you may have also noticed a confusing syntax for some sub-configurations in config_registry, such as for the zen sub-configuration, like:

zen: zen/haswell

This means that the zen family includes only itself (it is a singleton family), but it requires haswell kernels be available and built into the library. (This is because zen micro-kernels are so similar to those of haswell that you might as well just reuse the haswell code.) If some sandybridge kernels were also needed by the zen singleton family, you would see it as:

zen: zen/haswell/sandybridge

I'll explain all of this in the updated wikis too.

@iotamudelta

This comment has been minimized.

Contributor

iotamudelta commented Nov 3, 2017

@fgvanzee for the FreeBSD port, it'd be great if we could have repositories that include all applicable configurations for x86, x86_64, power, arm64, respectively. I'm happy to help out. Will first need to get the current state to work for FreeBSD. Either way: this is very nice.

Unrelated but you may be interested: on FreeBSD-HEAD and an AMD Carrizo, BLIS is competitive to / slightly faster than OpenBLAS for dgemms in my tests.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Nov 3, 2017

@iotamudelta I'm not that surprised that BLIS is highly competitive on an AMD Excavator core, as I vaguely remember observing this myself when I was first writing that microkernel. But, good to know that others are seeing the same thing.

@devinamatthews

This comment has been minimized.

Contributor

devinamatthews commented Dec 11, 2017

AFAIK this is done now; closing.

@fgvanzee

This comment has been minimized.

Contributor

fgvanzee commented Dec 11, 2017

Agreed, this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment