Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gnu: Different results between Serial/Parallel #5654

Closed
sbenkorichi opened this issue Oct 31, 2017 · 26 comments
Closed

Gnu: Different results between Serial/Parallel #5654

sbenkorichi opened this issue Oct 31, 2017 · 26 comments
Assignees

Comments

@sbenkorichi
Copy link
Collaborator

sbenkorichi commented Oct 31, 2017

This case is reported in the discussion group
https://mail.google.com/mail/u/0/#inbox/15f7317a33168390
Sample input file:
test.fds.txt

Serial
test_serial
Parallel
test_parallel

@rmcdermo
Copy link
Contributor

Salah,
Can you please make sure the problem exists with the latest code and that the input file follows the user guide. If so, then I'll take a look. Thanks.

@sbenkorichi
Copy link
Collaborator Author

Yes, It's with the latest. Weird man, Intel compiler has no issue.
Intel Prallel:
test_intel-parallel

@sbenkorichi
Copy link
Collaborator Author

This could be simply due to certain uninitialised variables.
Same behaviour when it was hanging using parallel runs.

@sbenkorichi
Copy link
Collaborator Author

This looks like it's related to this specific input file. I've done another verification with another input file
Results looks ok for serial and parallel.
couch .fds.txt

@sbenkorichi
Copy link
Collaborator Author

Going back to the FDS6.5.3 , everything works fine. I will check at which point this issue has started.
test_0253

@sbenkorichi
Copy link
Collaborator Author

@marcosvanella
Can you please review. This bug has started with this commit.
fcc0dda

@marcosvanella
Copy link
Contributor

marcosvanella commented Oct 31, 2017

Salah, looks like I commited my changes before merging from firemodels for that pull request. Please, test if this commit reproduces the issue:
01ebdc0

This is to make sure that the issue doesn't predate pull request #5053. Thank you.

@sbenkorichi
Copy link
Collaborator Author

I've tested it. Gnu with that commit (01ebdc0) works. It doesn't produce that behaviour.

@marcosvanella
Copy link
Contributor

Salah nevermind, 01ebdc0 works fine in parallel.
I'm checking if the issue is on another commit of pull request #5053.
Interesting, I'm running the case with the gnu debug target and can't reproduce the problem.
I do see it with optimization flags.

@sbenkorichi
Copy link
Collaborator Author

The issue started from the commit that I shared above, then Kevin merged it. From there, you would receive this type of behaviour. Before works fine.
Wonder why the db mode didn't complain about it

@marcosvanella
Copy link
Contributor

The issue started on commit e91baf6, when we "burned the bridge" and I got rid of the old IBM routines. For that commit I am also seeing the problem with debug flags.
Might have cut too deep there. I'll look at those changes.

@marcosvanella
Copy link
Contributor

A mistery, I don't see the issue with the latest repo and debug flags. Try it on your linux box.
Looks pretty much like something uninitialized.
Puzzling what this has to do with the old IBM.

@sbenkorichi
Copy link
Collaborator Author

I will check now

@sbenkorichi
Copy link
Collaborator Author

This is what I get when I try it with the db.

 $ mpirun -np 2 ./fds_mpi_gnu_linux_64_db test.fds 
 Mesh 1 is assigned to MPI Process 0
 Mesh 2 is assigned to MPI Process 1
 OpenMP thread   0 of   1 assigned to MPI process      1 of      1
 OpenMP thread   1 of   1 assigned to MPI process      1 of      1
 OpenMP thread   0 of   1 assigned to MPI process      0 of      1
 OpenMP thread   1 of   1 assigned to MPI process      0 of      1
 Completed Initialization Step  1
 Completed Initialization Step  2
 Completed Initialization Step  3
 Completed Initialization Step  4

 Fire Dynamics Simulator

 Current Date     : October 31, 2017  22:13:40
 Version          : FDS 6.6.0
 Revision         : FDS6.6.0-121-g035fed0-master
 Revision Date    : Tue Oct 31 15:54:42 2017 -0400
 Compiler         : Gnu gfortran 7.2.0-6ubuntu1~16.04.york0)
 Compilation Date : Oct 31, 2017  22:11:07

 MPI Enabled;    Number of MPI Processes:       2
 OpenMP Enabled; Number of OpenMP Threads:      2

 MPI version: 3.1
 MPI library version: Open MPI v3.0.0, package: Open MPI salah@salah Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

 Job TITLE        : 
 Job ID string    : test


Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2b43295c917a in ???
#1  0x2b43295c83a3 in ???
#2  0x2b432a7094af in ???
#3  0x559579 in __turbulence_MOD_wall_model
	at ../../Source/turb.f90:1220
#4  0x12bbf81 in __velo_MOD_velocity_bc
	at ../../Source/velo.f90:2336
#5  0x141582f in fds
	at ../../Source/main.f90:645
#6  0x1495ea4 in main
	at ../../Source/main.f90:5

@sbenkorichi
Copy link
Collaborator Author

On the optimised mode, I passed these flags: -m64 -O -Wall -fcheck=all -g -fbacktrace -ffpe-summary=none
This made it work. Weird, yes I feel it has to do with some initialization that is being complained about in the optimised mode.

@marcosvanella
Copy link
Contributor

Salah, try using the flag -finit-local-zero in your optimized code. This initializes to zero variables being created.

I've also noted that getting rid of the N_EDDY = 100, L_EDDY = 0.1, VEL_RMS = 0.2 inputs in the inflow &VENT gets the optimized code to run correctly again.

I'm doing a code to code comparison of variables to see where things diverge.

@sbenkorichi
Copy link
Collaborator Author

Ok.
Yes, this issue appears with the N_EDDY.
That would take some time, it's like looking for a needle in a desert. But, I know eventually it will be cornered.
I'm doing also some checks.

@sbenkorichi
Copy link
Collaborator Author

I found also -O1 works. So far -O2 and -O3 are capturing this issue.

@marcosvanella
Copy link
Contributor

After adding the -ggdb -ffpe-trap=invalid,zero,overflow flags to the optimized code I'm getting the error you were getting with the debug version on your computer.
Check the variable values I get with gdb:

screen shot 2017-11-01 at 1 39 40 pm

Variable S in this case is SF%ROUGHNESS coming into WALL_MODEL from VELOCITY_BC, when VELOCITY_BC_INDEX==WALL_MODEL_BC.

BTW, if you have gdb installed and compile with -ggdb flag, you can debug in parallel executing:

$ mpirun -n 2 xterm -e gdb /Users/mnv/Documents/FIREMODELS_ISSUES/fds/Build/mpi_gnu_osx_64/fds_mpi_gnu_osx_64

And then on each window that pops up (corresponding to each MPI process) type:
$run test.fds

Then whatever each process writes to screen is done on its window. Also, when the code crashes you can still print the value of local variable like I did with S
$p S

we are close.

@marcosvanella
Copy link
Contributor

marcosvanella commented Nov 1, 2017

All right, what's happening is that in SYNTHETIC_EDDY_SETUP (line ~1750 turb.f90) the routine EDDY_AMPLITUDE is being called before providing a value to VT%A_IJ. That is done later as:


   EDDY_LOOP: DO NE=1,VT%N_EDDY
      IERROR=1; CALL EDDY_POSITION(NE,NV,NM,IERROR)
      CALL EDDY_AMPLITUDE(NE,NV,NM)
   ENDDO EDDY_LOOP

   ! Cholesky decomposition of Reynolds stress tensor
   A_IJ => VT%A_IJ
   R_IJ => VT%R_IJ
   A_IJ = 0._EB
   A_IJ(1,1) = SQRT(R_IJ(1,1))
   A_IJ(2,1) = R_IJ(2,1)/A_IJ(1,1)
   A_IJ(2,2) = SQRT(R_IJ(2,2)-A_IJ(2,1)**2)
   A_IJ(3,1) = R_IJ(3,1)/A_IJ(1,1)
   A_IJ(3,2) = (R_IJ(3,2)-A_IJ(2,1)*A_IJ(3,1))/A_IJ(2,2)
   A_IJ(3,3) = SQRT(R_IJ(3,3)-A_IJ(3,1)**2-A_IJ(3,2)**2)

Then, in EDDY_AMPLITUDE, NaN values of A_IJ are fed into VT%CV_EDDY, etc. From then on all hell breaks loose. Our Options:

    1. Put the A_IJ calculation before the EDDY_LOOP.
    1. Set A_IJ to 0._EB in type.f90 in the VENT_TYPE definition.

I'll let @rmcdermo decide how to do this. Also in EDDY_AMPLITUDE, we can make the comparisons for RN explicitly as among 8 byte reals:


EPS_EDDY=-1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(1)=1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(2)=1._EB
CALL RANDOM_NUMBER(RN); IF (REAL(RN,EB)>0.5_EB) EPS_EDDY(3)=1._EB

@sbenkorichi
Copy link
Collaborator Author

Nice catch Marcos,
Yes, I found some suggestions for using gdb or valgrind for debugging it. Although I've yet tried them.

@sbenkorichi
Copy link
Collaborator Author

This is what I get with gdb. You can paste from xterm after selecting the appropriate lines, then use scroll mouse to paste with.

[New Thread 0x2aaaacf69700 (LWP 24196)]
[New Thread 0x2aaaad16a700 (LWP 24197)]
[New Thread 0x2aaaae36d700 (LWP 24199)]
[New Thread 0x2aaaae56e700 (LWP 24200)]
[New Thread 0x2aaaae76f700 (LWP 24203)]
 Mesh 1 is assigned to MPI Process 0
 Mesh 2 is assigned to MPI Process 1
 OpenMP thread   2 of   3 assigned to MPI process      0 of      1
 OpenMP thread   1 of   3 assigned to MPI process      0 of      1
 OpenMP thread   0 of   3 assigned to MPI process      0 of      1
 OpenMP thread   3 of   3 assigned to MPI process      0 of      1
 Completed Initialization Step  1
 Completed Initialization Step  2
 Completed Initialization Step  3
 Completed Initialization Step  4

 Fire Dynamics Simulator

 Current Date     : November  1, 2017  19:42:55
 Version          : FDS 6.6.0
 Revision         : FDS6.6.0-125-g2d797a4-dirty-master
 Revision Date    : Wed Nov 1 15:41:20 2017 +0200
 Compiler         : Gnu gfortran 7.2.0-1ubuntu1~16.04)
 Compilation Date : Nov 01, 2017  19:34:20

 MPI Enabled;    Number of MPI Processes:       2
 OpenMP Enabled; Number of OpenMP Threads:      4

 MPI version: 3.1
 MPI library version: Open MPI v3.0.0, package: Open MPI salah@sbenkorichi Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

 Job TITLE        : 
 Job ID string    : test


Thread 1 "fds_mpi_gnu_lin" received signal SIGFPE, Arithmetic exception.
0x00000000004e0cc4 in turbulence::wall_model (slip_factor=-1, 
    u_tau=nan(0xfffffffffffff), y_plus=nan(0xfffffffffffff), 
    u=<optimised out>, nu=<optimised out>, dy=0.050000000000000003, s=0)
    at ../../Source/turb.f90:1220
1220             Y_PLUS = Y_CELL_CENTER/S
(gdb) p S
$1 = 0
(gdb) bt
#0  0x00000000004e0cc4 in turbulence::wall_model (slip_factor=-1, 
    u_tau=nan(0xfffffffffffff), y_plus=nan(0xfffffffffffff), 
    u=<optimised out>, nu=<optimised out>, dy=0.050000000000000003, s=0)
    at ../../Source/turb.f90:1220
#1  0x000000000079ec53 in velo::velocity_bc (t=0, nm=1)
    at ../../Source/velo.f90:2336
#2  0x000000000081b96b in fds () at ../../Source/main.f90:645
#3  0x0000000000476117 in main (argc=argc@entry=2, argv=0x7fffffffd747)
    at ../../Source/main.f90:5
#4  0x00002aaaac77f830 in __libc_start_main (main=0x4760f0 <main>, argc=2, 
    argv=0x7fffffffd218, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7fffffffd208)
    at ../csu/libc-start.c:291
#5  0x0000000000476149 in _start ()

@marcosvanella
Copy link
Contributor

All right Salah, do an update and check you can run the case.

@sbenkorichi
Copy link
Collaborator Author

Yes Marcos, that helped. It solved the issue.
Two concerns:

  1. Why are these issues arising only when using parallel jobs.
  2. Do we have each time to make an explicit initialisation or do we have to pass a flag that take care of it?

@marcosvanella
Copy link
Contributor

  1. Uninitialized variables make the code behave in the most strange ways, as we just saw. Looks like the optimized gfortran code assigns memory positions to the different variables that have different states between serial and parallel. In serial somehow the code ends up seeing numbers, in parallel NaNs.

  2. Bottom line, every time a new variable is defined, it's good to make sure it is initialized to something in the source code. The typical case we have been seeing these days is variables that are fields of derived types. Exceptions are local variables defined in subroutines for which it's obvious they are getting populated. Easier said than done though.
    The init to zero flag is a work around for code that does not comply with fortran standard, or so the gnu devs claim:
    https://gcc.gnu.org/onlinedocs/gcc-3.3.6/g77/Variables-Assumed-To-Be-Zero.html

Close the case when you are ready.

@sbenkorichi
Copy link
Collaborator Author

Ok, thanks.
I guess we should set a plan for more cleaning in the code after the release. At least we need to make sure that all the verification cases are run with no complains or weirdos behaviours. There still always a room for improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants