Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

horiz_interp_conserve_mod:no latitude index found #83

Open
eocene opened this issue Jan 28, 2019 · 14 comments
Open

horiz_interp_conserve_mod:no latitude index found #83

eocene opened this issue Jan 28, 2019 · 14 comments
Labels
idealised Idealised setup: Single-Column, Held-Suarez, etc. infrastructure Isca infrastructure: installation, CI, HPC setups priority:low Low-priority task

Comments

@eocene
Copy link

eocene commented Jan 28, 2019

Hi all,

I have been running Isca on a machine that has recently had new nodes installed. Before the new nodes all was fine. Now, when running I get a fatal error returned from all PEs like the one below. It's hard to report to sys admin without specific request (I suspect something wasn't done when installing the new nodes but could easily be wrong). Before I go digging into this interpolation module that is triggering the error, I just wanted to check with you whether you had seen this before and/or had an idea what might be the trigger.

Thank you very much indeed for any possible help in advance,

2019-01-28 10:23:49,905 - isca - DEBUG - FATAL from PE 0: horiz_interp_conserve_mod:no latitude index found: n,sph= 1 NaN 2019-01-28 10:23:49,905 - isca - DEBUG - 2019-01-28 10:23:49,905 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

@sit23
Copy link
Contributor

sit23 commented Jan 28, 2019

Hi @eocene. I would imagine that the new nodes are using a different version of MPI, or something like that. So that I can help more, could you tell us what kind of model are you running? I.e. are you using grey radiation / RRTM / held-suarez etc? I would say that your error here is almost certainly not a problem with the interpolation routine, but is a symptom of some other problem. ( I would also advise against doing too much digging in the interpolation routine. It's very long and not that easy to read!).

@eocene
Copy link
Author

eocene commented Jan 29, 2019

Hi Stephen, thank you very much indeed for the (quick) response!

I've checked the MPI versions and apparently they are all the same. I absolutely agree about the symptom-not-cause diagnosis. For completeness, I have run various test_cases and mysteriously they all fail differently. All these have been recompiled again, and tried with various PE numbers etc. (Another thing I've tried is increasings "num_iters" in horiz_interp_conserve.F90 based on reading the code/docs, but to no avail).

I'm sure that there is something amiss/stupid/negligent on my end, but since the root cause seems to be a bit obscure at the moment any hints really would be very much appreciated! Thank you very much indeed.

axisymmetric fails exactly like my personal configuration:
201901-29 12:34:06,399 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:34:06,402 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:34:06,410 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:34:06,410 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:34:06,410 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:34:06,411 - isca - DEBUG - Y-AXIS = 8 8 8 8 8 8 8 8 2019-01-29 12:34:06,425 - isca - DEBUG - mean surface pressure= NaN mb 2019-01-29 12:34:06,437 - isca - DEBUG - NOTE from PE 0: idealized_moist_phys: Using Frierson Quasi-Equilibrium convection scheme. 2019-01-29 12:34:06,445 - isca - DEBUG - NOTE from PE 0: interpolator_mod :sn_1.000_sst.nc is a year-independent climatology file 2019-01-29 12:34:06,446 - isca - DEBUG - 2019-01-29 12:34:06,446 - isca - DEBUG - FATAL from PE 1: horiz_interp_conserve_mod:no latitude index found: n,sph= 1 NaN 2019-01-29 12:34:06,446 - isca - DEBUG -

Held-Suarez fails on a segmentation fault:
2019-01-29 12:07:35,307 - isca - DEBUG - / 2019-01-29 12:07:35,308 - isca - DEBUG - NOTE: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:07:35,310 - isca - DEBUG - NOTE: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:07:35,316 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:07:35,316 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:07:35,316 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:07:35,316 - isca - DEBUG - Y-AXIS = 64 2019-01-29 12:07:35,376 - isca - DEBUG - mean surface pressure= NaN mb 2019-01-29 12:07:35,528 - isca - DEBUG - forrtl: severe (174): SIGSEGV, segmentation fault occurred 2019-01-29 12:07:35,528 - isca - DEBUG - Image PC Routine Line Source 2019-01-29 12:07:35,528 - isca - DEBUG - libintlc.so.5 00002AB0523DABF1 tbk_trace_stack_i Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libintlc.so.5 00002AB0523D8D2B tbk_string_stack_ Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB050A22AC2 Unknown Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB050A22916 tbk_stack_trace Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB05097BAB0 for__issue_diagno Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB05098D658 for__signal_handl Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - libpthread-2.17.s 00002AB0505005E0 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006C4EEC Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006BFBA7 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006BD426 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 000000000045197C Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000411B40 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000468D75 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000907BEF Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 000000000040520E Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - libc-2.17.so 00002AB05264FC05 __libc_start_main Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000405119 Unknown Unknown Unknown

and Realistic-Continents fails on 'regularize: Failure to converge'
2019-01-29 12:24:46,273 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:24:46,277 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:24:46,286 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:24:46,286 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:24:46,286 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:24:46,287 - isca - DEBUG - Y-AXIS = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2019-01-29 12:24:46,459 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - FATAL from PE 1: regularize: Failure to converge 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - FATAL from PE 2: regularize: Failure to converge 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2 2019-01-29 12:24:46,460 - isca - DEBUG -

@AlexAudette
Copy link

I obtained a similar error using the realistic-continents case. The result being a crash with message :"regularize: Failure to converge".

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

Strange - @eocene did you end up getting a handle on this problem?

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

@AlexAudette - do you also find other test cases to be failing, or is it just the realistic continents one?

@AlexAudette
Copy link

@sit23 So far it is only with the realistic continents. I am able to run my simulation at T42 using the era_land_T42.nc land mask file, but when I create my own at T85, I get the same error as eocene.

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

@AlexAudette OK - that's a slightly different problem, which we have encountered ourselves. The background is that when you put data like topography into the spectral dynamical core, the spikiness of the data and the finite number of Fourier modes means that you form Gibbs ripples etc in the topography. To help counter this, the model automatically smooths the incoming topography, which reduces the size of the ripples. The degree of the smoothing is controlled by the parameter ocean_topog_smoothing in the spectral_dynamics_nml. The parameter represents a measure of the smoothness of the topography, with higher values meaning smoother topography, and a smoothing method is applied recursively until the incoming topography is as smooth as the parameter dictates. When you change resolution though, it's possible that the smoothing algorithm cannot smooth the topography enough for it to be smoother than the parameter dictates, and so it fails to converge, as per the error message you have. To sort this out, you can reduce the ocean_topog_smoothing parameter. That way you should find that the regularisation converges, and the model will stop giving you that error message.

@AlexAudette
Copy link

@sit23 Thanks for you answer. I tried reducing the ocean_topog_smoothing parameter from 0.8 to 0.05 by increments of 0.2, but still no success, I still get the same error :

2020-03-10 09:28:42,216 - isca - INFO - process running as 110162
2020-03-10 09:28:42,386 - isca - DEBUG - loadmodules for niagara machines
2020-03-10 09:28:42,470 - isca - DEBUG - The following modules were not unloaded:
2020-03-10 09:28:42,470 - isca - DEBUG - (Use "module --force purge" to unload all):
2020-03-10 09:28:42,470 - isca - DEBUG -
2020-03-10 09:28:42,470 - isca - DEBUG - 1) NiaEnv/2018a
2020-03-10 09:28:43,401 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 32768.
2020-03-10 09:28:43,401 - isca - DEBUG - &MPP_IO_NML
2020-03-10 09:28:43,401 - isca - DEBUG - HEADER_BUFFER_VAL = 16384,
2020-03-10 09:28:43,401 - isca - DEBUG - GLOBAL_FIELD_ON_ROOT_PE = T,
2020-03-10 09:28:43,401 - isca - DEBUG - IO_CLOCKS_ON = F,
2020-03-10 09:28:43,401 - isca - DEBUG - SHUFFLE = 0,
2020-03-10 09:28:43,401 - isca - DEBUG - DEFLATE_LEVEL = -1
2020-03-10 09:28:43,401 - isca - DEBUG - /
2020-03-10 09:28:43,405 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072.
2020-03-10 09:28:43,407 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000.
2020-03-10 09:28:43,411 - isca - DEBUG - starting 1 OpenMP threads per MPI-task
2020-03-10 09:28:43,412 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION
2020-03-10 09:28:43,412 - isca - DEBUG - X-AXIS = 256
2020-03-10 09:28:43,412 - isca - DEBUG - Y-AXIS = 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
2020-03-10 09:28:43,910 - isca - DEBUG -
2020-03-10 09:28:43,911 - isca - DEBUG - FATAL from PE 15: regularize: Failure to converge
2020-03-10 09:28:43,911 - isca - DEBUG -
...
2020-03-10 09:28:43,912 - isca - DEBUG -
2020-03-10 09:28:43,912 - isca - DEBUG - FATAL from PE 0: regularize: Failure to converge
2020-03-10 09:28:43,912 - isca - DEBUG -
2020-03-10 09:28:43,912 - isca - DEBUG - --------------------------------------------------------------------------
2020-03-10 09:28:43,912 - isca - DEBUG - MPI_ABORT was invoked on rank 14 in communicator MPI_COMM_WORLD
2020-03-10 09:28:43,912 - isca - DEBUG - with errorcode 1.
2020-03-10 09:28:43,912 - isca - DEBUG -
2020-03-10 09:28:43,912 - isca - DEBUG - NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
2020-03-10 09:28:43,912 - isca - DEBUG - You may or may not see output from other processes, depending on
2020-03-10 09:28:43,913 - isca - DEBUG - exactly when Open MPI kills them.
2020-03-10 09:28:43,913 - isca - DEBUG - --------------------------------------------------------------------------

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

OK - could you try setting it to 0? That should turn off the regularisation, and we can see if it runs then or not. You could also try increasing the parameter, just in case I've mis-remembered the way you need to go!

@AlexAudette
Copy link

So it runs now with the parameter set to 0, thank you very much. I tried as well to increase the parameter to 0.96 and still crashed at the place. I will keep an eye out for truncation effects. Thanks again!

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

OK - you will probably find that the gibbs ripples are significant without any smoothing. You'll particularly see it in the vertical velocity and the precipitation. When we've run with topography at T85, we have managed to run the smoothing, but I can't quite lay my hands on the smoothing parameter we used. I'll let you know if I find it. We are working on alternatives to this smoothing algorithm, which should be available soon.

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

Just found it - looks like I tried 0.85 for the smoothing parameter and it worked with T85 topography.

@AlexAudette
Copy link

Interesting, I just tried with this same value and it still fails to regularize. Did you do anything special with your topography file?

@sit23
Copy link
Contributor

sit23 commented Mar 10, 2020

Well, you're welcome to try the T85 topography file that I used and see if it works for you. You can find it here:
https://drive.google.com/file/d/1lsYsVE1pIDxOC_CV4SDJu8oxUxmgQ0za/view?usp=sharing

@dennissergeev dennissergeev added idealised Idealised setup: Single-Column, Held-Suarez, etc. phys:sfc Physics: topography, surface fluxes, bucket hydrology, vegetation priority:low Low-priority task infrastructure Isca infrastructure: installation, CI, HPC setups and removed phys:sfc Physics: topography, surface fluxes, bucket hydrology, vegetation labels May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idealised Idealised setup: Single-Column, Held-Suarez, etc. infrastructure Isca infrastructure: installation, CI, HPC setups priority:low Low-priority task
Projects
None yet
Development

No branches or pull requests

4 participants