Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMP threading in CICE #114

Closed
apcraig opened this issue Mar 30, 2018 · 12 comments
Closed

OMP threading in CICE #114

apcraig opened this issue Mar 30, 2018 · 12 comments

Comments

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2018

A few problematic OMP loops were unthreaded due to reproducibility problems found during testing. grep for TCXOMP. These are in ice_dyn_eap, ice_dyn_evp, and ice_transport_remap. One issue may be thread safety in icepack_ice_strength, but it requires additional debugging.

More generally, we need to review and validate that threading is working properly in CICE and Icepack.

@apcraig apcraig self-assigned this Mar 30, 2018
@eclare108213 eclare108213 added this to To do in CICE v6 release via automation Apr 8, 2018
@eclare108213
Copy link
Contributor

Let's look to see whether @mhrib addressed the ice_dyn_* loops in his refactorization

@eclare108213
Copy link
Contributor

see also #128

@eclare108213 eclare108213 removed this from To do in CICE v6 release Aug 9, 2018
@mhrib
Copy link
Contributor

mhrib commented Dec 3, 2018

I did find issues with the same OMPs (and a few more), but no solution other than comment-out as here. See also #252

@TillRasmussen
Copy link
Contributor

In addition to the ones that Mads (MHRI) found I found OMP issues in ice_history and ice_grid. I uncommented all OMP directives in these two files, which saved the model from crashing when running with Intel and GNU compilers. I have not found solutions nor specific locations for these bugs witin the file..

@eclare108213
Copy link
Contributor

I am uploading a set of slides here from a LANL training course on OpenMP profiling and debugging that I attended last week. Most of it is old news, but the profiling and debugging info at the end might be useful as we move forward with this task.
Workshop 6 Basic OpenMP and Profiling-2.pdf

@apcraig
Copy link
Contributor Author

apcraig commented Dec 30, 2021

I have created a perf_suite that will be PR'ed soon. This runs a fixed suite of tests that attempt to assess CICE performance at different task and thread counts. It basically does three things.

  • It runs a few cases on 1 PE with no threading with different block sizes to assess the impact on block size on model performance.
  • It uses a fixed 16x16 block size and runs a series of scaling tests on 1 to 128 MPI tasks.
  • It uses the same 16x16 block size and runs a series of timing tests on 64 PEs with 64 to 4 MPI tasks and 1 to 16 threads (i.e. 64x1, 32x2, 16x4, 8x8, 4x16).

This is all done with the gx1 grid, roundrobin decomp, 2 day runs, basic out of the box configuration. The idea is not to optimize the performance of CICE but to compare the performance of CICE on different hardware, different compilers, and different tasks/threads for a very fixed problem. This is, in part, a starting point for further OMP tuning.

I attach an xl spreadsheet, CICE_OMP_perf.xlsx, that shows the results from testing on Narwhal with 4 compilers and Cheyenne with 3 compilers in table and graph form. This is for hash 9fb518e of CICE dated Dec 21, 2021, but also includes the Narwhal port and the perf_suite (which will be PR'ed soon).

There are lots of interesting insights. But with regard to OMP, we see that in this version of CICE (which has lots of OMP loops turned off that still need debugging), OMP is still doing something. In these tests, OMP is never faster than just using all MPI for the same total PE count. But for a given MPI task count, threads run faster than running the same MPI task count but single threaded (i.e. 16x4 vs 16x1), at least on Narwhal. Cheyenne shows less benefit from threading. This establishes a performance baseline and provides a starting point to improve OMP performance, probably using Narwhal gnu or cray to continue OMP tuning efforts.

@apcraig
Copy link
Contributor Author

apcraig commented Dec 30, 2021

Note that CICE_OMP_perf.xlsx has an error, the 4x16 run is actually 8x16. I've fixed the error in perf_suite in my sandbox for future use. Ignore the 4x16 results for now.

@apcraig
Copy link
Contributor Author

apcraig commented Dec 31, 2021

I attach an updated OMP results table and graphs, CICE_OMP_perf.xlsx. This also has a second sheet that shows all timing info for the threading and unthreaded tests. If you look closely, you can see that Advection is just about the only section that threads reasonably. Column and Dynamics do not thread well and maybe not at all. I'll try to understand this better.

@TillRasmussen
Copy link
Contributor

TillRasmussen commented Jan 1, 2022

For the dynamic part most of the OMP has been commented out including the one in the subcycling iteration.

@apcraig
Copy link
Contributor Author

apcraig commented Jan 19, 2022

I believe #680 largely addresses this PR. Will close this issue when #680 is merged. We'll need to remain diligent with respect to OpenMP validation and performance.

@apcraig apcraig mentioned this issue Jan 19, 2022
16 tasks
@apcraig
Copy link
Contributor Author

apcraig commented Mar 10, 2022

This has largely been addressed in #680 and apcraig#64. There are still some known issues in VP and 1d EVP.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 10, 2022

I will close, VP and 1d EVP has their own issues. FYI, added omp_suite and perf_suite to check OpenMP and evaluate performance.

@apcraig apcraig closed this as completed Mar 10, 2022
anton-seaice pushed a commit to ACCESS-NRI/CICE that referenced this issue Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants