Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception ("merge: can't read model file !") in mergemod.c #34

Closed
pplotn opened this issue May 19, 2021 · 8 comments
Closed

Exception ("merge: can't read model file !") in mergemod.c #34

pplotn opened this issue May 19, 2021 · 8 comments

Comments

@pplotn
Copy link

pplotn commented May 19, 2021

Sometimes, during my using of Denise PSV I get following error ("merge: can't read model file !") in mergemod.c.
What can be the reasons for this?
I am using 12 nodes 32 cpu each. NPROCX=4,NPROCY=4

**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files
writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin.??? ... finished.
Copying... ... finished.
Use
ximage n1=384 < ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin label1=Y label2=X title=./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin
to visualize model.

PE 0 is writing model to
./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0

**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0
R U N - T I M E E R R O R:
merge: can't read model file !
...now exiting to system.

-rw-r--r-- 1 plotnips k1404 0 May 19 22:17 modelTest_rho_stage_1_it_10.bin
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.3

@daniel-koehn
Copy link
Owner

Hi Pavel,

Assuming that you used 16 CPU cores for the parallelization with domain decompositon, the remaining cores are used for shot parallelization. How many shots are you modelling in total? Are they dividible by 24 without any remainder? Does the problem also occur when using less cores for the shot parallelization, or in the extreme case only using the domain decomposition?

Best regards,

Daniel

@pplotn
Copy link
Author

pplotn commented May 21, 2021

Hello Daniel,
I am modeling 51 shots.
As I understand, I use 4*4=16 cores per shot.
Overall, I have 12*32=384 cores.
It means, that I parallelize over 384/16=24 shots.
It means, I need 3 iterations to go through al the 51 shots.

This exception is very rare, I don't get it for other model size and number of shots.

20320209ws_fwi_3_strategy_51_Overthrust_true.err.txt
20320209ws_fwi_3_strategy_51_Overthrust_true.out.txt

@daniel-koehn
Copy link
Owner

Hi Pavel,

I have the suspicion, that one problem when using shot parallelization might be, that non-merged model files are removed in
PSV/model_it_out_PSV:

https://github.com/daniel-koehn/DENISE-Black-Edition/blob/master/src/PSV/model_it_out_PSV.c

Try to comment or delete all remove() functions in model_it_out_PSV.c and recompile the source code, before running the code again. If this is indeed the issue, similar problems will occur in gauss_filt.c and gauss_filt_var.c

Best regards,

Daniel

@pplotn
Copy link
Author

pplotn commented May 21, 2021

Ok, thanks Daniel. I recompiled the code and the problem still occurs on the same velocity model. Though on other models it is not happening.

PE 0 is writing model to
./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0
**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0
R U N - T I M E E R R O R:
merge: can't read model file !
...now exiting to system.

@pplotn
Copy link
Author

pplotn commented Jun 1, 2021

Hello, in my experience setting Nprocx and Nprocy helps to get rid of this error.
It works with parallelization by shots enabled.

@pplotn
Copy link
Author

pplotn commented Jun 26, 2021

Increasing stringsize variable in fd.h file helped.

@daniel-koehn
Copy link
Owner

That makes sense. If the stringsize of the model name and directory are longer than the pre-defined maximum stringsize in fd.h, the numbering of the domain decomposition might be missing in the file name extension of the model files. Therefore, the mergemod function will fail to merge the model files from the different sub-domains correctly. Thank you for finding this bug, Pavel.

@pplotn
Copy link
Author

pplotn commented Jun 27, 2021

Yes, Daniel.
I have a bit complicated paths to my folders. So I increased STRINGSIZE to 150.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants