Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable use of -ftree-vectorize for OpenFOAM v2112 with foss/2021b #15495

Merged

Conversation

Micket
Copy link
Contributor

@Micket Micket commented May 12, 2022

(created using eb --new-pr)

No test suite, but some trusted users indicated that there were some serious error in the results when building this for our cluster as is.

I think it's best to keep vectorize turned off, might apply to older version as well, but i don't know how to verify

@Micket Micket added the bug fix label May 12, 2022
@Micket
Copy link
Contributor Author

Micket commented May 12, 2022

Test report by @Micket
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
vera-c1 - Linux CentOS Linux 7.9.2009, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/ad258fb5762cce59ad028f6f30efe092 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented May 12, 2022

Test report by @jfgrimm
SUCCESS
Build succeeded for 10 out of 10 (1 easyconfigs in total)
node134.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/8e296687f4c1b960e3496e0f62657671 for a full test report.

@jfgrimm jfgrimm added this to the next release (4.5.5?) milestone May 13, 2022
@drhakannilsson
Copy link

I'm Håkan Nilsson, referred to by Mikael (Micket). I am preparing the different tests and my post-docs are doing the tests. Today I have also compiled OpenFOAM-v2112 the standard way and with the flags used by EasyBuild at the cluster Tetralith at nsc.liu.se. I have asked my colleagues to run the test there as well, just to confirm that the problem does not only appear at our cluster Vera. It is weekend, so I expect them to do it on Monday. I would also like to test it in Ubuntu, since OpenFOAM is commonly used in that platform. As seen in our tests, there was no problem with the EasyBuild installations for two previous versions of OpenFOAM. Perhaps there is some detail in v2112 that has this particular problem. I guess the tests by the developers at ESI are not done with the EasyBuild configuration, so it was not seen. Also, the simulations do not fail but just give incorrect results. I am not sure they would catch this error. My intention is to open an issue in the OpenFOAM bug reporting system, but first I would like to do a few more tests. We are also working on a new TestHarness for OpenFOAM, see https://sourceforge.net/p/turbowg/TestHarness/ci/master/tree/. Our experiences from this problem will be used in that design, so that we can hopefully identify these problems before releases in the future. We will set up some open case to test this particular issue. The one we use right now is an assignment in a course, so we don't want to release it (since we then have to construct a new assignment). However, not all combinations of compiler flags can be tested in a test harness. But I guess the EasyBuild combination should be part of a test loop, since I guess there is a reason why it has been decided that the flags in EasyBuild should be different than the ones used in the original release(?)

@Micket
Copy link
Contributor Author

Micket commented May 16, 2022

But I guess the EasyBuild combination should be part of a test loop

There are so many combinations of compilers (GCC, Clang, Intel, Cray) with different optimization flags by default, whose defaults change from version to version, combined with just as many CPU architectures; there really is no other real alternative than to ship a test suite that can be run by the person building it.

What upstream can do to make sure it builds cleanly without warnings, then possibly complement that with static analysis or putting the test suite through something like ASan (-fsanitize=address) for extra insurance.

, since I guess there is a reason why it has been decided that the flags in EasyBuild should be different than the ones used in the original release(?)

Well, to be blunt, the default flags OF specifies aren't good. -march=native should essentially be a requirement when building things for clusters. We don't need portable binaries, and not specifying flags like this means it won't be using the new instructions available on the CPUs, which where they get most of their FLOPs.
We are hardly the only ones doing this, and some old blogposts and presentations seems to indicate it gives OF about 20% speedup (and i would expect this number to increase since even better vector instructions have been introduced since)

The difference between -O2 and -O3 is more often than not mostly negligible, so, EB is actually making the conservative choice -O2 -ftree-vectorize.
None of these flags should be destructive, and I.M.H.O. i would always consider it a bug if a code can't handle changing them.
The fact that the error only occurs with --march=native -O2 -ftree-vectorize and not with --march=native -O3 (which should include the tree vectorizer + even more aggressive optimization flags) is the most unexpected result.
Without a MWE (and some time) how this could have occurred is just speculation though.

@migueldiascosta
Copy link
Member

The fact that the error only occurs with --march=native -O2 -ftree-vectorize and not with --march=native -O3 (which should include the tree vectorizer + even more aggressive optimization flags) is the most unexpected result. Without a [MWE] (and some time) how this could have occurred is just speculation though.

Indeed, but speculating anyway, it's possible that -O2 -ftree-vectorize actually vectorizes more loops than -O3 by using a "cheaper" cost model?

In any case, from GCC 12 -ftree-vectorize is included by default with -O2, so whatever the problem is, it won't be specific to EasyBuild, if anything EB is the canary in the coal mine.

@drhakannilsson
Copy link

We have now done the tests at Tetralith, at nsc.liu.se (https://www.nsc.liu.se/systems/tetralith/)

My own standard OpenFOAM installation with -O3 (using buildtool-easybuild/4.5.3-nsce8837e7 foss/2020b):
PASS
My own standard OpenFOAM installation with default EasyBuild flags (using buildtool-easybuild/4.5.3-nsce8837e7 foss/2020b):
PASS
Regular EasyBuild administrator installation with OpenFOAM/2112-nsc1-intel-2018b-eb:
PASS

I have asked my post-doc to do similar tests also in Ubuntu. It probably needs to fail in Ubuntu if the developers should have a look at it, since they need to be able to reproduce the problem. They can't look at it if they need access to a particular cluster.

Any input to why this may happen is gratefully accepted.

@drhakannilsson
Copy link

If you are interested to see how the results differ, I attach some figures. You don't need to understand exactly what they show, only that all curves except the dash-dotted green curve give the same results. In particular, the friction factor (cf) gives large oscillations. Not all tests are included here, since it becomes messy, but all installations with the EasyBuild flags at Vera give those incorrect results and all other installations give the "correct" results (not matching with experiment is another thing). In particular, the EasyBuild flags at Tetralith also gives "correct" results (not shown in these figures).
cf_upperWall
Diffuser-Ux
cp_upperWall

@drhakannilsson
Copy link

The test has now also been done in Ubuntu 18.04.6 LTS
Both the original compilation and the compilation with EasyBuild flags PASS.
A note is that the the residuals do not go under 1e-5 with the EasyBuild flags, while they reduce to 1e-8 with the standard compilation. So, there is still some effect.
This makes it quite hard to ask the developers to have a look at this problem, since it only occurs at Vera with the EasyBuild modifications of the flags. They don't have access to Vera.
Any suggestions are more than welcome. In particular how I could formulate an issue upstream under these circumstances.

@drhakannilsson
Copy link

Although I am now out of my comfort zone, I am attaching a comment from a software engineer collaboration partner of mine, which you may also comment on:
###
The only true documentation for those flags will always be the gcc source code, which I went to explore (using the latest git release)

The "-ftree-vectorize " is in fact an alias for enabling both -ftree-loop-vectorize and -ftree-slp-vectorize

When requesting the -O2 optimization level, those switches will toggle the "cost model used for vectorization" from "very-cheap" to "cheap".
When requesting the -O3 optimization level, the "cost model used for vectorization" is set to dynamic by default.

From the gcc file ./gcc/opts.cc
/* Use -fvect-cost-model=cheap instead of -fvect-cost-mode=very-cheap
by default with explicit -ftree-{loop,slp}-vectorize. */
if (opts->x_optimize == 2
&& (opts_set->x_flag_tree_loop_vectorize
|| opts_set->x_flag_tree_vectorize))
SET_OPTION_IF_UNSET (opts, opts_set, flag_vect_cost_model, VECT_COST_MODEL_CHEAP);

So basically, it looks like those -ftree-loop-vectorize and -ftree-slp-vectorize options are only useful when using -O2.
###
A FOLLOW-UP MESSAGE:
###
According to the source code, -O2 alone uses VECT_COST_MODEL_VERY_CHEAP. Adding the flags will switch to the VECT_COST_MODEL_CHEAP, whatever that means.
And when using -O3, the source code will use by default VECT_COST_MODEL_DYNAMIC.

There is a man page entry that slightly discuss those models:

   -fvect-cost-model=model
       Alter the cost model used for vectorization.  The model argument should be one of unlimited, dynamic or cheap.  With the unlimited model the vectorized code-path is assumed to be
       profitable while with the dynamic model a runtime check guards the vectorized code-path to enable it only for iteration counts that will likely execute faster than when executing
       the original scalar loop.  The cheap model disables vectorization of loops where doing so would be cost prohibitive for example due to required runtime checks for data dependence
       or alignment but otherwise is equal to the dynamic model.  The default cost model depends on other optimization flags and is either dynamic or cheap.

In the source code for gcc, there is not much about VECT_COST_MODEL_DYNAMIC when searching for that string. The compiler is making a difference if the model number is "smaller" or "greater" than VECT_COST_MODEL_CHEAP (those are enums in the source code), that's about it.
###

@drhakannilsson
Copy link

Following the notes in my previous post, I have now verified that all of the following PASS at Vera:
-O3 -march=native
-O3 -ftree-vectorize
-O3 -ftree-vectorize -march=native
-O3 -march=native -fvect-cost-model=cheap
The results typically differ (from pure -O3) at most at the 7th digit, and in several cases up to 12 digits. Perhaps some of the flags actually don't change anything, depending on how gcc take those flags into consideration in these combinations.
I will sum up these findings and post an upstream issue to make the developers aware that there is some problem in v2112 that may appear in some specific situations.

@akesandgren
Copy link
Contributor

@drhakannilsson Does it produce correct results when built with -O0, -O1, -O2 without extra flags? All three with and without -march=native. Those are fairly important check points.

@drhakannilsson
Copy link

@drhakannilsson Does it produce correct results when built with -O0, -O1, -O2 without extra flags? All three with and without -march=native. Those are fairly important check points.

That would take some time to test, since that is 6 full compilations and my disk is full. If I could just submit compilation of all of those at once it would be doable. @Micket - can I have access to more disk so I can more installations at the same time? Right now I have to delete or tar installations if I want to test a new one. I am installing in C3SE 2021/3-1.

We have now started to set up the same tests using a standard tutorial, which can be shared. The aim is to include it in our TestHarness. If you like, I can share it as it is at the moment (although I would like to add wallShearStress as well first). I attach the figures it produces at the moment:

At Vera (C3SE), where Reference is default flags in Ubuntu, of is default flags, eb is easybuild flags and C3SE is easybuild flags but with -novectorize:
C3SE_U
C3SE_V

At Tetralith (NSC), where Reference is default flags in Ubuntu, of is default flags, eb is easybuild flags and NSC is Intel compilation:
NSC_U
NSC_V

It is particularly interesting to zoom in the V-profiles just after the backward facing step. It can be seen that all results differ at Vera, while only the eb result differs at Tetralith:

Vera:
C3SE_V_Zoom

Tetralith:
NSC_V_Zoom

We have also encountered another effect. This particular tutorial puts the sampling points just at the boundary of the computational domain. That is fine with the default flags, but not with the easybuild flags and not even with the present easybuild installation at Vera with -novectorize. It is the same at Tetralith, where there is a problem with the easybuild flags but not with the original flags. In fact, also the Intel installation fails due to this, so this is a more severe problem that should definitely be reported upstream. What happens is that the code is searching for the cell labels of the sampling points, and the code gets stuck in an infinite loop. It is just strange that it depends on compiler flags. It can be solved by making sure that the sampling points are not put exactly at the boundary.

@boegel boegel changed the title Disable vectorize on OpenFOAM v2112 foss/2021b disable use of -ftree-vectorize for OpenFOAM v2112 with foss/2021b May 24, 2022
@boegel
Copy link
Member

boegel commented May 24, 2022

@boegelbot please test @ generoso
CORE_CNT=16

@boegel
Copy link
Member

boegel commented May 24, 2022

I'm fine with being a bit cautious here, and disabling -ftree-vectorize, as it has been confirmed that not using it circumvents the problem.
It would be nice to have some way of catching a problem like this easily though...

Is there any way that we could run one of the OpenFOAM tutorial examples on a small number of cores (single-node) to catch this?
Whatever the answer is, there's no need to block the PR over this, we can follow-up in another issue/PR...

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=15495 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15495 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8566

Test results coming soon (I hope)...

- notification for comment with ID 1135509542 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/d10c07025b06a6117075227c06ae73ff for a full test report.

@Micket
Copy link
Contributor Author

Micket commented May 24, 2022

@boegel I did make a O3 build

toolchainopts = {'cstd': 'c++11', 'opt': True}

that @drhakannilsson confirmed did not exhibit these errors, so, we can also go for that. I don't know how much this matters performance wise for OpenFOAM. I'm fine with either option.

@drhakannilsson I haven't followed up on your email yet, i'll just respond here: As this works on Tetralith (which is the same type of hardware as Vera), and only seems to occur with a very particular combination of flags (and even then, not on all setups), i suspect we have reached the limit of the investigation we can do with just just playing around with compiler options.
Next step would probably need to be trying to isolate the problem.
My first step is always to use -fsanitize=address as it can spot issues with the code (it has very few false positives).
Or one will need to basically break down the code, stepping through line by line with a debugger or by actually creating basically unit tests of the problematic functions. This is of course extremely tedious, and the first step would definitely be to to create a minimal example. I can't emphasize minimal enough here. If one requires a plot to spot the problem, then it's way to big.

And, even if the cause for this issue is found, as @boegel indicate; we welcome tests (with automatic comparison to correct values) as it's the only way to give any confidence when building software. Heck, things like this could be compiler bugs!

@drhakannilsson
Copy link

@Micket I have run a number of simulations to test the computational speed. I report the settings and numbers below.

EasyBuild installation with -O3 -march=native -fno-math-errno -std=c++11 -fuse-ld=bfd

ExecutionTime = 116.01 s ClockTime = 118 s
ExecutionTime = 113.53 s ClockTime = 116 s
ExecutionTime = 120.21 s ClockTime = 122 s
ExecutionTime = 93.5 s ClockTime = 95 s
ExecutionTime = 112.99 s ClockTime = 115 s

EasyBuild installation with -O2 -fno-tree-vectorize -march=native -fno-math-errno -std=c++11 -fuse-ld=bfd

ExecutionTime = 134.11 s ClockTime = 136 s
ExecutionTime = 138.57 s ClockTime = 141 s
ExecutionTime = 136.13 s ClockTime = 138 s
ExecutionTime = 118.7 s ClockTime = 121 s

We can see that -O3 seems to be slightly faster. We can also see some variation in speed. This simulation is on a single core, submitted using -n 1 -c 2. Can the variation be due to other jobs running on the same node? OpenFOAM is sensitive to the local bus memory bandwidth, it has been shown. Further related question later.

I also present times from my own installations, since I want you to comment on why they are faster:

Standard OpenFOAM Installation with -O3

ExecutionTime = 92.73 s ClockTime = 95 s
ExecutionTime = 92.19 s ClockTime = 94 s
ExecutionTime = 96.58 s ClockTime = 134 s

Standard OpenFOAM installation with -O2 -ftree-vectorize -march=native -fno-math-errno -std=c++11 -fuse-ld=bfd

ExecutionTime = 137.28 s ClockTime = 182 s
ExecutionTime = 99.49 s ClockTime = 101 s
ExecutionTime = 102.11 s ClockTime = 104 s
ExecutionTime = 85.68 s ClockTime = 87 s
ExecutionTime = 90.24 s ClockTime = 289 s (tail -f)
ExecutionTime = 96.16 s ClockTime = 306 s (tail -f)
ExecutionTime = 76.12 s ClockTime = 284 s (tail -f)
ExecutionTime = 82.91 s ClockTime = 85 s

In all tests you can see a quite large variation, which may be due to interference with jobs by others running on the same node, sharing the same memory bus (or something else?). I have also marked some tests with "(tail -f)", for which I looked at the log file using that command during the entire simulation. It is quite clear that it influences the computational speed A LOT. I did not expect this. Should I have expected this?

@drhakannilsson
Copy link

@boegel , @Micket I attach the test procedure I am using, if you are interested to have a look at it and perhaps use it (or improve it). There is a README file that describes how to use it. It basically picks up a standard OpenFOAM tutorial and does a small modification to it to avoid a problem with probes exactly at the boundary (will be reported upstream). It also adds an additional post-processing step so we can plot cf and cp. It is a single-core simulation that takes about 95-140s (on Vera). Submit scripts for Vera and Tetralith are supplied, although not useful for everyone. Plotting is done with python3/Matplotlib. The python script also reports some details in the terminal window, which is work in progress to automatically report PASS/FAIL. If you know a good way to compare two curves and calculate a useful error in Python, please let me know. Different scalings seem to be needed for the different curves/errors.

Test procedure:
backwardFacingStep.zip

As mentioned before, we are working on more tests, which will be distributed through https://sourceforge.net/p/turbowg/TestHarness/ci/master/tree/, with reports at https://my.cdash.org/index.php?project=TurboWG

@akesandgren
Copy link
Contributor

If the code is memory bandwidth sensitive then you must run it with a full NUMA node allocated, i.e. one CPU socket in this case.
otherwise other jobs will affect runtime and sometimes a lot...

@drhakannilsson
Copy link

If the code is memory bandwidth sensitive then you must run it with a full NUMA node allocated, i.e. one CPU socket in this case. otherwise other jobs will affect runtime and sometimes a lot...

This is another topic than the original one for this thread, but since it was asked for computational times...

Our production simulations are hundreds of cores, and we always request full nodes to not interfer with others. A problem is that we interfer with ourselves, i.e. the processes of the same simulation fight for the same memory bus. This reduces the speed a lot. I have discussed this at SNIC Application Expert meetings and in other forums, but I have not seen any solution to this.

For the particular speed tests done here, which submission flags would you propose? It is a sequential simulation. I read that Vera is set up for hyper-treading, but I am not sure if that influences sequential jobs. I guess that --exclusive gives me an entire node, and not only an entire CPU socket?

@boegel
Copy link
Member

boegel commented May 24, 2022

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3513.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/21fe4583565d6e86b350d9f7a6265b26 for a full test report.

@boegel
Copy link
Member

boegel commented May 24, 2022

I'll go ahead and merge this, since it fixes an important problem...

If it's really worth moving from -O2 to -O3, we can tackle that in a follow-up PR?

@boegel
Copy link
Member

boegel commented May 24, 2022

Going in, thanks @Micket!

@boegel boegel merged commit 0082dc5 into easybuilders:develop May 24, 2022
@Micket
Copy link
Contributor Author

Micket commented May 24, 2022

Our production simulations are hundreds of cores, and we always request full nodes to not interfer with others. A problem is that we interfer with ourselves, i.e. the processes of the same simulation fight for the same memory bus. This reduces the speed a lot. I have discussed this at SNIC Application Expert meetings and in other forums, but I have not seen any solution to this.

If your problem is memory bound (and I wouldn't actually expect it to be for typical CFD? I see lots of simulations that manage to keeping cores quite busy), then you simply need to spread the tasks across more CPUs, and simply tell script to place the tasks appropriately (i.e. number of tasks per socket).
You may also find it necessary to tweaks things like OPENBLAS_NUM_THREADS so that it doesn't double-parallelize.

Otherwise, writing better cache aware algorithms (perhaps a block sparse matrix), or using smaller floats (thus less memory to copy).

For the particular speed tests done here, which submission flags would you propose? It is a sequential simulation. I read that Vera is set up for hyper-treading, but I am not sure if that influences sequential jobs. I guess that --exclusive gives me an entire node, and not only an entire CPU socket?

Correct, --exclusive makes the job exclusive on the entire node, and that is fine. What Åke means is to ensure we aren't sharing the numa node with anyone else. We can let the other one go to waste while we are benchmarking just to be sure.
Another thing one can do is to copy everything to TMPDIR during testing to isolate us from the shared filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants