Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Performance issue in C++ standalone (probably platform/compiler-specific) #803
While running a standard HH model (see code in the end), I noted that the example slowed down dramatically when I switched to standalone mode. I could reproduce this issue on two Linux machines, but not on a Linux cluster where I had SSH access. The issue basically goes away when re-introducing
I guess this is Linux/gcc/whatever-specific, or can you confirm this on Windows?
Here quick benchmark values for various compiler options (1s of the below model, basically all spent in the state updater)
defaultclock.dt = 0.01*ms El = 10.613*mV ENa = 115*mV EK = -12*mV gl = 0.3*msiemens/cm**2 gNa0 = 120*msiemens/cm**2 gK = 36*msiemens/cm**2 C = 1*uF/cm**2 eqs = ''' dv/dt = (gl * (El-v) + gNa * m**3 * h * (ENa-v) + gK * n**4 * (EK-v) + I) / C : volt I : amp/meter**2 gNa : siemens/meter**2 dm/dt = alpham * (1-m) - betam * m : 1 dn/dt = alphan * (1-n) - betan * n : 1 dh/dt = alphah * (1-h) - betah * h : 1 alpham = (0.1/mV) * (-v+25*mV) / (exp((-v+25*mV) / (10*mV)) - 1)/ms : Hz # alpham = alpha_fun(v) : Hz betam = 4 * exp(-v/(18*mV))/ms : Hz alphah = 0.07 * exp(-v/(20*mV))/ms : Hz betah = 1/(exp((-v+30*mV) / (10*mV)) + 1)/ms : Hz alphan = (0.01/mV) * (-v+10*mV) / (exp((-v+10*mV) / (10*mV)) - 1)/ms : Hz betan = 0.125*exp(-v/(80*mV))/ms : Hz ''' axon = NeuronGroup(500, eqs, method='exponential_euler', threshold='v>50*mV', refractory='v>50*mV')
Couple of ideas.
Did you check the exact flags sent to the compiler for both runtime and standalone? It may be that weave adds in some flags that aren't there on standalone?
Do the values in this model decay to zero during the run? If so, it could be the old issue with denormal/subnormal numbers issue? We have a bit of code somewhere for gcc that forces them to round to zero which can speed things up a lot - perhaps this isn't making it to the standalone code?
Ok, I'm giving up... I did not manage to pin down the true cause of this strange behaviour, thanks for your suggestions, though.
weave does indeed add a few more flags (e.g.
Also a good point, but we don't enable the code for that by default (it's a preference), neither for weave nor for standalone. Enabling it did not change anything.
So, I think it is some gcc quirk where for some combination of equations/parameters optimisations go wrong. I can reproduce it with two different gcc versions, though (4.8.5 and 5.4.0), so it's not a bug introduced with the recent release. However, it does occur very rarely, I do not see the same issue with any other example I tried, including the very similar
To end on a positive note, I did however find a way to workaround it on my machine, by using clang instead of gcc. Conveniently, we can actually switch this with a preference:
prefs.devices.cpp_standalone.extra_make_args_unix += ['CC=clang++']
added a commit
Jan 23, 2017
added a commit
Jan 23, 2017
Hi guys! I was not able to reproduce the issue with latest (2.1+git) brian2 and gcc 4.8.4/5.4.1/6.2.0 on my ubuntu 14.04 machine. Both python2 and python3 sessions produce expected results (numpy > cython > weave (python2) > standalone). Standalone builds produced by gcc (versions mentioned above) are close to clang-3.9 build in terms of performance. Marcel, could you upload the full project (including generated output/ folder) somewhere?
Hi @xj8z, I'd be more than happy if you could shed some light on this issue! I uploaded my generated code here: http://s000.tinyupload.com/index.php?file_id=53613789273553615794
from brian2 import * set_device('cpp_standalone') # ... (code from above) run(1*second, report='text')
It might also be relevant on what machine you compile it (e.g. whether your CPU supports AVX/AVX2).
This issue is not compiler related. Identical 'main' binaries compiled for avx (avx2 is not required to reproduce the issue) demonstrate 7x performance difference while running with different ld/libc/libm software stacks. To be more specific, if you move dynamically compiled 'main' binary from ubuntu 16.04 (glibc 2.23) to ubuntu 14.04 (glibc 2.19) you can observe 7x acceleration mentioned above. My current understanding is that glibc packaged with ubuntu 16.04 fails to detect cpu flags correctly (it does it when application starts) and fallbacks to non-optimized versions of math functions (e.g. libm's exp) while glibc from ubuntu 14.04 detects cpu flags correctly and runs optimized versions (e.g. __ieee754_exp_avx) instead. That's why the same binary is 7x faster on ubuntu 14.04. It just uses optimized libm routines. To avoid this dynamic misbehavior of glibc packaged with 16.04 you may try to statically compile 'main' binary (manually add '-static' to LFLAGS in the makefile). We may use this dirty fix while I'm trying to fix the real problem. Stay tuned.
Great, many thanks for looking into this! Statically linking does indeed solve the problem on my machine, that's a much better workaround then switching the compiler (not everyone has
prefs.codegen.cpp.extra_link_args += ['-static']
Here two links about the same problem, the Ubuntu bug report did not get any response, though:
That's a bug in Glibc. It has been introduced in Glibc 2.23  and fixed in Glibc 2.25 . Both exp() and pow() are impacted. Ubuntu 16.04, 16.10 and not yet released 17.04 use codebase with the bug. Standalone binary will suffer from performance degradation while running on these versions of Ubuntu. I'm not sure that the proper fix will be backported from Glibc 2.25 (impact/risk ratio needs to be estimated first) which means that we need to come up with a suitable workaround.
This bug hits AVX-SSE transition penalty . 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. This operation is time consuming (40-80 cycles). Store/restore is required because old-fashioned SSE knows nothing about the upper halfs of its registers and may damage them. To avoid this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halfs of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halfs of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions?) and avoid doing store/restore operation (why to care about random trash in the upper halfs of the YMM registers?). It's not okay to mix SSE with AVX-256 due to the transition penalty. But Glibc does that.
You may ask why do we care about vector instructions (SSE/AVX-128/AVX-256) if what we do in the example program is scalar computation? We care because scalar floating-point intructions are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well.
Now let's see what happens inside libm library when we call exp() from the main executable.
(a) When we call exp() from the binary compiled with -static -ffinite-math-only
This is a simplest scenario. Executable is statically linked which means that no on-the-fly actions are required from the loader. Usage of finite math operations allow us to use __ieee754_exp_avx directly and do not handle inf corner cases.
main (floating point code compiled as AVX-128)
All the code is compiled as AVX-128. No penalty.
(b) When we call exp() from the binary compiled with -static -fno-finite-math-only
We asked libm to handle inf corner cases so it calls __ieee754_exp_avx() from additional inf-processing wrapper.
main (floating point code compiled as AVX-128)
Note that __exp() inf-processing wrapper uses SSE instructions. Glibc contains multiple variations of __ieee754_exp funtion (__ieee754_exp_sse, __ieee754_exp_avx) but only single __exp(). To be able to use Glibc on non-AVX machines __exp() is compiled as SSE code.
Still no penatly because we mix SSE with AVX-128. No AVX-256 means no problems.
(c) When we call exp() from the binary compiled without -static but with -fno-finite-math-only
Main binary is compiled as dynamic binary and on-the-fly actions are required from the loader to resolve exp() from libm during the first call to it (known as lazy linking). Inf-handling wrapper is also required.
main (floating point code compiled as AVX-128)
Dynamic linker uses AVX-256 instructions to push AVX registers to the stack at the very beginning of its operation and pop them back at the end. That was done to allow symbol resolver to use (overwrite) some of these registers while looking for a requested symbol.
Now we mix SSE, AVX-128 and AVX-256 instructions. And observe SSE/AVX transition penalty.
Please note that the route mentioned in (c) will be hit by the main executable only once. When exp() symbol will be resolved by the dynamic loader its address will be cached and all future accesses to exp() won't require a call to _dl_runtime_resolve(). Hence, all other accesses to exp() will follow route (b).
But the key point here is that it doesn't really matter. All future accesses to exp() will suffer from SSE/AVX transition penalty. Even if they follow route (b). When _dl_runtime_resolve() uses AVX-256 instructions once during symbol lookup it marks upper (non-XMM) half of YMM register as dirty. Dirty means that the upper half of the YMM register may contain some meaningful data and CPU needs to store/restore these bits during transitioning to SSE. But no one drop this dirty flag during the whole program execution. With this flag set even transition from SSE to AVX-128 (and back) requires to store/restore the upper part of the register. CPU faithfully stores upper half of the YMM register while switching from AVX-128 to SSE (main -> __exp() and back) and restores it back while switching from SSE to AVX-128 (__exp() -> __ieee754_exp_avx() and back). This transition happens multiple times for every exp() call. And it really hurts performance.
Fixing Ubuntu and other distros
As I said before, this issue has been fixed in Glibc 2.25 . Fix is quite straightforward. Implementation of _dl_runtime_resolve() tries to avoid using AVX-256 instructions if possible. Unfortunately, many distros (including Ubuntu 16.04/16.10/17.04) use older versions of Glibc (2.23 and 2.24) which don't contain this fix. Ubuntu 16.04 is the key player here because it is widely used LTS release. There is no way for LTS to update its Glibc from 2.23 to 2.25 because it may break binary compatibility and will probably introduce new bugs. The only thing we can do is to backport this specific fix from Glibc 2.25 to Glibc 2.23 and request for SRU (stable release update). Maintainers will decide if impact/risk ratio of the fix is high enough to update stable release which is in production use.
I'll try to discuss the possibility of SRU with the distro maintainers but can't guarantee that their decision will be positive. I suppose that the issue with exp() has been around for so long because most part of developers compile their code with -ffast-math which automatically sets -ffinite-math-only. And as we discussed before, -fno-finite-math-only is absolutely required to reproduce the issue with SSE/AVX transition with exp(). It means that in case of exp() 'impact' part of the impact/risk ratio is moderately small because just a few of us will feel the difference. Situation with pow() is much worse though. According to my findings it suffers from performance degradation even if the application has been compiled with -ffinite-math-only. This happens because the whole pow() procedure is implemented with SSE inside libm. Hence, no need in inf-wrapper which contains SSE instructions. So, basically, we have pow() which demonstrates performance degradation all the time and exp() which demonstrates it only when compiled with -fno-finite-math-only. Does fixing these issues worth touching mission-critical component of the stable release? I don't know. I'd better commit a suitalbe workaround and then wait for a SRU decision.
There are many different ways to workaround the issue:
(1) building main executable statically with -static compilation flag
By doing static compilation we avoid on-the-fly (lazy) linking and prevent buggy _dl_runtime_resolve() from running. No AVX-256 instructions get executed. No penalty.
(2) running main executable with LD_BIND_NOW=1 flag
By passing this flag we force dynamic linker to resolve all the symbols during application startup. Again, no lazy linking and no call to buggy _dl_runtime_resolve().
(3) compile main binary with -ffinite-math-only instead of -fno-finite-math-only
By using finite math only we prevent inf-wrapper from running. This solves the issue with exp(). But pow() still suffers from performance degradation because its body consists of SSE instructions not just inf-handling wrapper.
(4) manually drop 'dirty' flag after a call to exp() and pow()
Special intrinsic __builtin_ia32_vzeroupper() can be used after a call to exp() and pow() to let CPU know that we don't care about the upper halfs of YMM registers. It basically clears dirty flag set by _dl_runtime_resolve(). This intrinsic is translated into VZEROUPPER instruction which is pretty lightweight to execute. Theoretically, you need to call it just once, after a first call to exp() and pow() when _dl_runtime_resolve() has been involved. In practice, you may call it after every call to these functions. It shouldn't hurt performance but is much simpler to implement this way. The problem with this approach is that you need to know if it's safe to call VZEROUPPER or not. If your code (main binary) uses AVX-256 instruction you may destroy meaningful data by manually zeroing upper halfs of all YMM registers.
(5) come up with your own inf-processing wrapper and call finite version of exp() from it
Inf-processing wrapper implemented in libm is quite simple . By making your own alternative to it you can make sure that it gets compiled the same way as the main binary (as AVX code or as SSE code). All elements of the call chain (main -> inf-wrapper -> __ieee754_exp_avx) will use AVX instructions. Hence, no penalty for exp(). Calls to pow() will still suffer.
I'd probably stick to (1) or (2). Or maybe (3) if you're sure that pow() is not used by popular models.
(1) Only HH model suffers from performance degradation. Why does this happen?
HH model heavily relies on exp() from libm. That's why performance degradation of exp() seriously affects overall performance of the model. For some other model overall performance impact may be much less, sometimes even negligible.
(2) Main binary compiled with Clang doesn't suffer from performance degradation. Why does this happen?
Clang ignores -fno-finite-math-only flag if it passed together with -ffast-math. Hence, no inf-handling wrapper gets called. No penalty. That's probably a bug in Clang.
(3) Why does -mno-avx build demonstrate better performance than -mavx build, not as good as runtime though?
Number of SSE/AVX transitions differs. In case of -mavx build you have two in every direction (avx in main -> sse in inf-wrapper -> avx in finite exp handler). In case of -mno-avx you have just one in every direction (sse in main -> sse in inf-wrapper -> avx in finite exp handler). Due to the fact that exp() call should return back to the main binary you need to multiply the number of transition by 2. So, basically, we have either 4 transitions (-mavx) or 2 transitions (-mno-avx). With runtime we have 0 transitions.
(4) Runtime build generated by weave/cython doesn't suffer from performance degradation. Why does this happen?
Weave (and probably Cython) compiles target binary differently then standalone. Instead of compiling an executable, it compiles a shared object (library) and then links it to the running python instance (using import statement which is translated into dlopen). I suppose that it leads to a different symbol resolution path which doesn't trigger buggy _dl_runtime_resolve(). For instance, if exp() symbol has been already resolved in the parent python instance, dynamic linker won't be looking for it the second time. It will simply copy previously found address without a call to _dl_runtime_resolve().
Wow, I did not expect such an in-depth analysis! Many thanks, this clears it up nicely (did you also add all this info somewhere else, e.g. in a Debian/Ubuntu bug report?). So let's consider the possible workarounds:
In general, (1) would be the easiest solution, since we could just add
So, maybe (2) is the best option. We could have a new general preference to set environment variables during execution which we would set by default to
As a side note, replacing glibc's libm by openlibm leads to a performance increase of ~30% for the example we discussed here (but without
BTW: when we acknowledge your work on this in the release notes, should we use your github handle or your real name? In the latter case, you'd have to give it to us
Either way, many thanks again -- my workaround to use clang was not a good workaround, it seems...
I just opened a bug against glibc package in Ubuntu: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280 Situation seems to be much worse than we initially thought. Huge amount of math functions provided by libm are affected by AVX/SSE transition penalty in one way or another. Routines which have AVX-optimized implementation (exp, log, sin/cos/tan) experience slowdown when get called from SSE-only code (generated by gcc -march=x86-64 which is a default for Ubuntu packages). Routines which don't have AVX-optimized implementation and rely on general purpose SSE implementation (pow, exp2/exp10, log2/log10, sincos, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh) experience slowdown when get called from AVX-optimized code (generated by gcc -march=native on AVX-capable machines). I believe that this issue is worth fixing in 16.04. Thanks a lot for discovery of this bug and making this information public.
I didn't mention -Wl,-z,now in the workarounds section because it didn't work for me either. It seems to me that -Wl,-z,now and LD_BIND_NOW=1 try to achieve the same goal but do it differently. I can see (using gdb) that -Wl,-z,now indeed does the job. First call to exp() goes directly to __exp() wrapper without a hop to _dl_runtime_resolve(). But I suspect that -Wl,-z,now does pre-run symbol resolution using our good friend _dl_runtime_resolve(). While it does the job (symbols are resolved before application starts) it doesn't fix the bug because no matter when _dl_runtime_resolve() gets called it provokes AVX/SSE transition penalty. LD_BIND_NOW probably uses some other routines to do symbol resolution.
I updated my GitHub profile. Feel free to use my real name instead of a nickname in the release notes. Thanks!
Ok, great, I marked me as affected by the Ubuntu bug so its status is now "confirmed". Hopefully it gets attention from maintainers soon.
I'll try to update Brian's documentation and add the
Fedora 24/25 is also affected by this bug. I just filed a bug there as well: https://bugzilla.redhat.com/show_bug.cgi?id=1421121 RHEL7 contains quite old Glibc version and doesn't suffer from performance degradation. New RHEL8 (ETA 2018) will probably use Glibc 2.25 (or newer) which already contains the fix.
added a commit
Feb 15, 2017
Since I have close to no knowledge about low level instruction stuff, just to make sure:
Yes, this is strictly about CPU code (and additionally needs the combination of specific hardware and a specific version of glibc on a Linux system).