Gnu: Different results between Serial/Parallel #5654

sbenkorichi · 2017-10-31T16:05:38Z

This case is reported in the discussion group
https://mail.google.com/mail/u/0/#inbox/15f7317a33168390
Sample input file:
test.fds.txt

Serial

Parallel

rmcdermo · 2017-10-31T16:08:33Z

Salah,
Can you please make sure the problem exists with the latest code and that the input file follows the user guide. If so, then I'll take a look. Thanks.

sbenkorichi · 2017-10-31T16:12:10Z

Yes, It's with the latest. Weird man, Intel compiler has no issue.
Intel Prallel:

sbenkorichi · 2017-10-31T16:48:33Z

This could be simply due to certain uninitialised variables.
Same behaviour when it was hanging using parallel runs.

sbenkorichi · 2017-10-31T17:06:04Z

This looks like it's related to this specific input file. I've done another verification with another input file
Results looks ok for serial and parallel.
couch .fds.txt

sbenkorichi · 2017-10-31T19:03:13Z

Going back to the FDS6.5.3 , everything works fine. I will check at which point this issue has started.

sbenkorichi · 2017-10-31T19:50:39Z

@marcosvanella
Can you please review. This bug has started with this commit.
fcc0dda

marcosvanella · 2017-10-31T21:00:50Z

Salah, looks like I commited my changes before merging from firemodels for that pull request. Please, test if this commit reproduces the issue:
01ebdc0

This is to make sure that the issue doesn't predate pull request #5053. Thank you.

sbenkorichi · 2017-10-31T21:15:20Z

I've tested it. Gnu with that commit (01ebdc0) works. It doesn't produce that behaviour.

marcosvanella · 2017-10-31T21:36:53Z

Salah nevermind, 01ebdc0 works fine in parallel.
I'm checking if the issue is on another commit of pull request #5053.
Interesting, I'm running the case with the gnu debug target and can't reproduce the problem.
I do see it with optimization flags.

sbenkorichi · 2017-10-31T21:47:56Z

The issue started from the commit that I shared above, then Kevin merged it. From there, you would receive this type of behaviour. Before works fine.
Wonder why the db mode didn't complain about it

marcosvanella · 2017-10-31T21:52:37Z

The issue started on commit e91baf6, when we "burned the bridge" and I got rid of the old IBM routines. For that commit I am also seeing the problem with debug flags.
Might have cut too deep there. I'll look at those changes.

marcosvanella · 2017-10-31T21:59:25Z

A mistery, I don't see the issue with the latest repo and debug flags. Try it on your linux box.
Looks pretty much like something uninitialized.
Puzzling what this has to do with the old IBM.

sbenkorichi · 2017-10-31T22:08:07Z

I will check now

sbenkorichi · 2017-10-31T22:16:58Z

This is what I get when I try it with the db.

 $ mpirun -np 2 ./fds_mpi_gnu_linux_64_db test.fds 
 Mesh 1 is assigned to MPI Process 0
 Mesh 2 is assigned to MPI Process 1
 OpenMP thread   0 of   1 assigned to MPI process      1 of      1
 OpenMP thread   1 of   1 assigned to MPI process      1 of      1
 OpenMP thread   0 of   1 assigned to MPI process      0 of      1
 OpenMP thread   1 of   1 assigned to MPI process      0 of      1
 Completed Initialization Step  1
 Completed Initialization Step  2
 Completed Initialization Step  3
 Completed Initialization Step  4

 Fire Dynamics Simulator

 Current Date     : October 31, 2017  22:13:40
 Version          : FDS 6.6.0
 Revision         : FDS6.6.0-121-g035fed0-master
 Revision Date    : Tue Oct 31 15:54:42 2017 -0400
 Compiler         : Gnu gfortran 7.2.0-6ubuntu1~16.04.york0)
 Compilation Date : Oct 31, 2017  22:11:07

 MPI Enabled;    Number of MPI Processes:       2
 OpenMP Enabled; Number of OpenMP Threads:      2

 MPI version: 3.1
 MPI library version: Open MPI v3.0.0, package: Open MPI salah@salah Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

 Job TITLE        : 
 Job ID string    : test


Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2b43295c917a in ???
#1  0x2b43295c83a3 in ???
#2  0x2b432a7094af in ???
#3  0x559579 in __turbulence_MOD_wall_model
	at ../../Source/turb.f90:1220
#4  0x12bbf81 in __velo_MOD_velocity_bc
	at ../../Source/velo.f90:2336
#5  0x141582f in fds
	at ../../Source/main.f90:645
#6  0x1495ea4 in main
	at ../../Source/main.f90:5

sbenkorichi · 2017-10-31T23:30:29Z

On the optimised mode, I passed these flags: -m64 -O -Wall -fcheck=all -g -fbacktrace -ffpe-summary=none
This made it work. Weird, yes I feel it has to do with some initialization that is being complained about in the optimised mode.

marcosvanella · 2017-11-01T16:02:11Z

Salah, try using the flag -finit-local-zero in your optimized code. This initializes to zero variables being created.

I've also noted that getting rid of the N_EDDY = 100, L_EDDY = 0.1, VEL_RMS = 0.2 inputs in the inflow &VENT gets the optimized code to run correctly again.

I'm doing a code to code comparison of variables to see where things diverge.

sbenkorichi · 2017-11-01T16:05:51Z

Ok.
Yes, this issue appears with the N_EDDY.
That would take some time, it's like looking for a needle in a desert. But, I know eventually it will be cornered.
I'm doing also some checks.

sbenkorichi · 2017-11-01T16:10:35Z

I found also -O1 works. So far -O2 and -O3 are capturing this issue.

marcosvanella · 2017-11-01T17:50:35Z

After adding the -ggdb -ffpe-trap=invalid,zero,overflow flags to the optimized code I'm getting the error you were getting with the debug version on your computer.
Check the variable values I get with gdb:

Variable S in this case is SF%ROUGHNESS coming into WALL_MODEL from VELOCITY_BC, when VELOCITY_BC_INDEX==WALL_MODEL_BC.

BTW, if you have gdb installed and compile with -ggdb flag, you can debug in parallel executing:

$ mpirun -n 2 xterm -e gdb /Users/mnv/Documents/FIREMODELS_ISSUES/fds/Build/mpi_gnu_osx_64/fds_mpi_gnu_osx_64

And then on each window that pops up (corresponding to each MPI process) type:
$run test.fds

Then whatever each process writes to screen is done on its window. Also, when the code crashes you can still print the value of local variable like I did with S
$p S

we are close.

marcosvanella · 2017-11-01T19:14:28Z

All right, what's happening is that in SYNTHETIC_EDDY_SETUP (line ~1750 turb.f90) the routine EDDY_AMPLITUDE is being called before providing a value to VT%A_IJ. That is done later as:


   EDDY_LOOP: DO NE=1,VT%N_EDDY
      IERROR=1; CALL EDDY_POSITION(NE,NV,NM,IERROR)
      CALL EDDY_AMPLITUDE(NE,NV,NM)
   ENDDO EDDY_LOOP

   ! Cholesky decomposition of Reynolds stress tensor
   A_IJ => VT%A_IJ
   R_IJ => VT%R_IJ
   A_IJ = 0._EB
   A_IJ(1,1) = SQRT(R_IJ(1,1))
   A_IJ(2,1) = R_IJ(2,1)/A_IJ(1,1)
   A_IJ(2,2) = SQRT(R_IJ(2,2)-A_IJ(2,1)**2)
   A_IJ(3,1) = R_IJ(3,1)/A_IJ(1,1)
   A_IJ(3,2) = (R_IJ(3,2)-A_IJ(2,1)*A_IJ(3,1))/A_IJ(2,2)
   A_IJ(3,3) = SQRT(R_IJ(3,3)-A_IJ(3,1)**2-A_IJ(3,2)**2)

Then, in EDDY_AMPLITUDE, NaN values of A_IJ are fed into VT%CV_EDDY, etc. From then on all hell breaks loose. Our Options:

1. Put the A_IJ calculation before the EDDY_LOOP.
1. Set A_IJ to 0._EB in type.f90 in the VENT_TYPE definition.

I'll let @rmcdermo decide how to do this. Also in EDDY_AMPLITUDE, we can make the comparisons for RN explicitly as among 8 byte reals:


EPS_EDDY=-1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(1)=1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(2)=1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(3)=1._EB

sbenkorichi · 2017-11-01T19:28:24Z

Nice catch Marcos,
Yes, I found some suggestions for using gdb or valgrind for debugging it. Although I've yet tried them.

sbenkorichi · 2017-11-01T19:50:17Z

This is what I get with gdb. You can paste from xterm after selecting the appropriate lines, then use scroll mouse to paste with.

[New Thread 0x2aaaacf69700 (LWP 24196)]
[New Thread 0x2aaaad16a700 (LWP 24197)]
[New Thread 0x2aaaae36d700 (LWP 24199)]
[New Thread 0x2aaaae56e700 (LWP 24200)]
[New Thread 0x2aaaae76f700 (LWP 24203)]
 Mesh 1 is assigned to MPI Process 0
 Mesh 2 is assigned to MPI Process 1
 OpenMP thread   2 of   3 assigned to MPI process      0 of      1
 OpenMP thread   1 of   3 assigned to MPI process      0 of      1
 OpenMP thread   0 of   3 assigned to MPI process      0 of      1
 OpenMP thread   3 of   3 assigned to MPI process      0 of      1
 Completed Initialization Step  1
 Completed Initialization Step  2
 Completed Initialization Step  3
 Completed Initialization Step  4

 Fire Dynamics Simulator

 Current Date     : November  1, 2017  19:42:55
 Version          : FDS 6.6.0
 Revision         : FDS6.6.0-125-g2d797a4-dirty-master
 Revision Date    : Wed Nov 1 15:41:20 2017 +0200
 Compiler         : Gnu gfortran 7.2.0-1ubuntu1~16.04)
 Compilation Date : Nov 01, 2017  19:34:20

 MPI Enabled;    Number of MPI Processes:       2
 OpenMP Enabled; Number of OpenMP Threads:      4

 MPI version: 3.1
 MPI library version: Open MPI v3.0.0, package: Open MPI salah@sbenkorichi Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

 Job TITLE        : 
 Job ID string    : test


Thread 1 "fds_mpi_gnu_lin" received signal SIGFPE, Arithmetic exception.
0x00000000004e0cc4 in turbulence::wall_model (slip_factor=-1, 
    u_tau=nan(0xfffffffffffff), y_plus=nan(0xfffffffffffff), 
    u=<optimised out>, nu=<optimised out>, dy=0.050000000000000003, s=0)
    at ../../Source/turb.f90:1220
1220             Y_PLUS = Y_CELL_CENTER/S
(gdb) p S
$1 = 0
(gdb) bt
#0  0x00000000004e0cc4 in turbulence::wall_model (slip_factor=-1, 
    u_tau=nan(0xfffffffffffff), y_plus=nan(0xfffffffffffff), 
    u=<optimised out>, nu=<optimised out>, dy=0.050000000000000003, s=0)
    at ../../Source/turb.f90:1220
#1  0x000000000079ec53 in velo::velocity_bc (t=0, nm=1)
    at ../../Source/velo.f90:2336
#2  0x000000000081b96b in fds () at ../../Source/main.f90:645
#3  0x0000000000476117 in main (argc=argc@entry=2, argv=0x7fffffffd747)
    at ../../Source/main.f90:5
#4  0x00002aaaac77f830 in __libc_start_main (main=0x4760f0 <main>, argc=2, 
    argv=0x7fffffffd218, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7fffffffd208)
    at ../csu/libc-start.c:291
#5  0x0000000000476149 in _start ()

marcosvanella · 2017-11-01T20:04:14Z

All right Salah, do an update and check you can run the case.

sbenkorichi · 2017-11-01T20:11:36Z

Yes Marcos, that helped. It solved the issue.
Two concerns:

Why are these issues arising only when using parallel jobs.
Do we have each time to make an explicit initialisation or do we have to pass a flag that take care of it?

marcosvanella · 2017-11-01T20:33:15Z

Uninitialized variables make the code behave in the most strange ways, as we just saw. Looks like the optimized gfortran code assigns memory positions to the different variables that have different states between serial and parallel. In serial somehow the code ends up seeing numbers, in parallel NaNs.
Bottom line, every time a new variable is defined, it's good to make sure it is initialized to something in the source code. The typical case we have been seeing these days is variables that are fields of derived types. Exceptions are local variables defined in subroutines for which it's obvious they are getting populated. Easier said than done though.
The init to zero flag is a work around for code that does not comply with fortran standard, or so the gnu devs claim:
https://gcc.gnu.org/onlinedocs/gcc-3.3.6/g77/Variables-Assumed-To-Be-Zero.html

Close the case when you are ready.

sbenkorichi · 2017-11-01T20:38:16Z

Ok, thanks.
I guess we should set a plan for more cleaning in the code after the release. At least we need to make sure that all the verification cases are run with no complains or weirdos behaviours. There still always a room for improvement.

rmcdermo assigned sbenkorichi Oct 31, 2017

sbenkorichi closed this as completed Nov 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gnu: Different results between Serial/Parallel #5654

Gnu: Different results between Serial/Parallel #5654

sbenkorichi commented Oct 31, 2017 •

edited

Loading

rmcdermo commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017 •

edited

Loading

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

marcosvanella commented Nov 1, 2017 •

edited

Loading

sbenkorichi commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

Gnu: Different results between Serial/Parallel #5654

Gnu: Different results between Serial/Parallel #5654

Comments

sbenkorichi commented Oct 31, 2017 • edited Loading

rmcdermo commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017 • edited Loading

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

marcosvanella commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

sbenkorichi commented Oct 31, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

marcosvanella commented Nov 1, 2017 • edited Loading

sbenkorichi commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

marcosvanella commented Nov 1, 2017

sbenkorichi commented Nov 1, 2017

sbenkorichi commented Oct 31, 2017 •

edited

Loading

marcosvanella commented Oct 31, 2017 •

edited

Loading

marcosvanella commented Nov 1, 2017 •

edited

Loading