Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building the fortran model with call_py_fort in debug mode leads to crashes #365

Open
spencerkclark opened this issue Apr 18, 2023 · 1 comment

Comments

@spencerkclark
Copy link
Member

spencerkclark commented Apr 18, 2023

I was exploring the possibility of addressing #340 (now that we are removing the serialize tests in #364 we might as well explore eliminating testing in docker entirely). This requires running tests in debug mode in the nix environment. In doing so I came across the fact that the basic native regression tests crash due to call_py_fort-related code:

call set_state("rank", rank)

A workaround would be to build the model without call_py_fort in debug mode to exercise this functionality, but ideally these tests would not crash in debug mode even when the model is built with call_py_fort active.

A basic way to reproduce this is to copy the configure.fv3.nix file into a new file within FV3/conf, set DEBUG=Y and REPRO= within it, configure/build the model, and run the tests:

$ cp FV3/conf/configure.fv3.nix FV3/conf/configure.fv3.nix_debug

    <edit configure.fv3.nix_debug>

$ cd FV3
$ configure nix_debug
$ cd ..
$ make build_native
$ pytest -vv -k default --native tests/pytest/test_regression.py

The traceback for one of the failing tests can be found below:

===================================================================== FAILURES ======================================================================
_____________________________________________________ test_regression_native[Linux-default.yml] _____________________________________________________

run_native = <function run_native.<locals>.run_native at 0x7f080f93c3a0>, config_filename = 'default.yml'
tmpdir = local('/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0')
system_regtest = <pytest_regtest.RegTestFixture object at 0x7f07e563ec70>

    @pytest.mark.parametrize(
        "config_filename",
        [
            pytest.param("default.yml", marks=pytest.mark.basic),
            pytest.param("model-level-coarse-graining.yml", marks=pytest.mark.coarse),
            pytest.param("pressure-level-coarse-graining.yml", marks=pytest.mark.coarse),
            "baroclinic.yml",
            "restart.yml",
        ],
    )
    def test_regression_native(run_native, config_filename: str, tmpdir, system_regtest):
        config = get_config(config_filename)
        rundir = tmpdir.join("rundir")
>       run_native(config, str(rundir))

test_regression.py:123:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = {'data_table': 'default', 'diag_table': default
2000 1 1 0 0 0

"atmos_static", -1, "hours", 1, "hours", "time"
"atmos... "all", "none", "none", 2
, 'experiment_name': 'default', 'forcing': 'gs://vcm-fv3config/data/base_forcing/v1.1/', ...}
run_dir = '/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0/rundir', error_expected = False

    def run_native(config, run_dir: str, error_expected=False):
        fv3config.write_run_directory(config, run_dir)
        completed_process = subprocess.run(
            ["mpirun", "-n", "6", exe.absolute().as_posix()],
            cwd=run_dir,
            capture_output=True,
        )
        if completed_process.returncode != 0 and not error_expected:
            print("Tail of Stderr:")
            print(completed_process.stderr[-2000:].decode())
            print("Tail of Stdout:")
            print(completed_process.stdout[-2000:].decode())
>           pytest.fail()
E           Failed

conftest.py:77: Failed
--------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------
Tail of Stderr:
       0  shoc_cld= F  uni_cld= F  ntot3d=           1  ntot2d=           1  shocaftcnv= F  indcld=          -1  shoc_parm=   7000.0000000000000        1.0000000000000000        4.2857143000000004       0.69999999999999996       -999.00000000000000       ncnvw=        -999  ncnvc=        -999
  resetting Model%frac_grid= F

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f6c45875b90 in ???
#1  0x7f6c45874dc5 in ???
#2  0x7f6c3b54b39f in ???
#3  0x7f6c126490cf in ???
#4  0x7f6c3b20d26a in ???
#5  0x7f6c3b1fdb38 in ???
#6  0x7f6c3b1f7ab2 in ???
#7  0x7f6c3b1f8dac in ???
#8  0x7f6c3b1ffe03 in ???
#9  0x7f6c3b3060be in ???
#10  0x7f6c3b30645d in ???
#11  0x7f6c3b30648a in ???
#12  0x7f6c3b302cc8 in ???
#13  0x7f6c3b26c372 in ???
#14  0x7f6c3b22819e in ???
#15  0x7f6c3b200d67 in ???
#16  0x7f6c3b3060be in ???
#17  0x7f6c3b2260e1 in ???
#18  0x7f6c3b1f8dac in ???
#19  0x7f6c3b1ffe03 in ???
#20  0x7f6c3b1f7ab2 in ???
#21  0x7f6c3b1f8dac in ???
#22  0x7f6c3b1fce8b in ???
#23  0x7f6c3b1f7ab2 in ???
#24  0x7f6c3b1f8dac in ???
#25  0x7f6c3b1fc326 in ???
#26  0x7f6c3b1f7ab2 in ???
#27  0x7f6c3b1f8dac in ???
#28  0x7f6c3b1fc326 in ???
#29  0x7f6c3b1f7ab2 in ???
#30  0x7f6c3b2266d4 in ???
#31  0x7f6c3b226a4b in ???
#32  0x7f6c3b32c20e in ???
#33  0x7f6c3b1ff9d5 in ???
#34  0x7f6c3b3060be in ???
#35  0x7f6c3b30645d in ???
#36  0x7f6c3b30648a in ???
#37  0x7f6c45aebd84 in ???
#38  0x7f6c45aec06f in ???
#39  0x7f6c45aeba8d in ???
#40  0x7f6c45aeb466 in ???
#41  0x43c099 in __atmos_model_mod_MOD_update_atmos_physics
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:463
#42  0x4431b6 in __atmos_model_mod_MOD_update_atmos_radiation_physics
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:280
#43  0x476877 in coupler_main
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:192
#44  0x47964c in main
	at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:35

Tail of Stdout:
de=           0 (0=count only, 1=replace)
 performing qc of albm     mode=           0 (0=count only, 1=replace)
 performing qc of zorm     mode=           0 (0=count only, 1=replace)
 performing qc of stc1m    mode=           0 (0=count only, 1=replace)
 performing qc of stc2m    mode=           0 (0=count only, 1=replace)
 performing qc of stc3m    mode=           0 (0=count only, 1=replace)
 performing qc of stc4m    mode=           0 (0=count only, 1=replace)
 performing qc of smc1m    mode=           0 (0=count only, 1=replace)
 performing qc of smc2m    mode=           0 (0=count only, 1=replace)
 performing qc of smc3m    mode=           0 (0=count only, 1=replace)
 performing qc of smc4m    mode=           0 (0=count only, 1=replace)
 performing qc of vegm     mode=           1 (0=count only, 1=replace)
 performing qc of vetm     mode=           1 (0=count only, 1=replace)
 performing qc of sotm     mode=           1 (0=count only, 1=replace)
 performing qc of sihm     mode=           1 (0=count only, 1=replace)
 performing qc of sicm     mode=           1 (0=count only, 1=replace)
 performing qc of vmnm     mode=           1 (0=count only, 1=replace)
 performing qc of vmxm     mode=           1 (0=count only, 1=replace)
 performing qc of slpm     mode=           1 (0=count only, 1=replace)
 performing qc of absm     mode=           1 (0=count only, 1=replace)
 ==============
 final results
 ==============
 dbgx --fixratio: F F F F

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5415 RUNNING AT spencer-vm
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@spencerkclark
Copy link
Member Author

The particular flag that leads to errors is -ffpe-trap=invalid,zero,overflow; if we change it to -ffpe-trap=invalid,zero then the errors go away, so the underlying issue is apparently some kind of overflow.

spencerkclark added a commit that referenced this issue Sep 7, 2023
This PR refactors the build infrastructure in this repo to eliminate the need for the Docker component.  All development and testing is now done in the `nix` shell.  This should be a quality of life improvement for anyone developing the fortran model, as it no longer requires maintaining checksums in two separate build environments.

In so doing it introduces the following changes:
- New `make` rules are provided for compiling the model in different modes:
  - `build` -- build executables in `repro` (our production mode) and `debug` mode.
  - `build_repro` -- build only the `repro` mode executable.
  - `build_debug` -- build only the `debug` mode executable.
- Tests are run with each of the executables available in the local `bin` directory, and are tagged with the associated compile mode.  
- An option, `check_layout_invariance`, is provided to trigger regression tests be run with a 1x2 domain decomposition instead of a 1x1 domain decomposition to check invariance to the domain decomposition layout; this is used for the all the coarse-graining regression tests and replaces the previous `test_run_reproduces_across_layouts` test that would run in the docker image.
- `debug`-mode and `repro`-mode simulations produce different answers, which is something we noticed in #364 when upgrading compiler versions as well, and so require different reference checksums.

In working on this PR, we ran the fortran model in `debug` mode in more contexts than we had previously, some of which turned up errors, which we currently work around by using `pytest.skip` (something we had implicitly already been doing before):
- #365
- #381 

Working on this PR also brought my attention to the fact that `pytest`'s `tmpdir` fixture does not automatically get cleaned up after each test; `pytest` versions older than 7.3.0 keep around directories from the last three runs of `pytest`, which fill up disk space quickly since running these tests requires creating 10's of run directories, each with their own initial conditions and input files (#380).  For the time being I manually clean up these run directories after successful tests.

Resolves #340.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant