Added OpenMP to the cubed-sphere interpolator and the matching cubed sphere partitioner.#293
Conversation
wdeconinck
left a comment
There was a problem hiding this comment.
Thanks @odlomax this is a good idea. I have a few suggestions below.
Also I'm wondering why the complexity of the variant and visitor. You can directly insert into an oversized triplet vector, like done in #292 . The allocated memory is anyway the same because the variant takes the size of the largest type. Then you also avoid the extra visitor and copy into a new vector of triplets. The eckit::SparseMatrix constructor skips over empty triplets during construction.
| // Loop over grid and set partioning[]. | ||
| auto lonlatIt = grid.lonlat().begin(); | ||
| for (gidx_t i = 0; i < grid.size(); ++i) { | ||
| atlas_omp_parallel_for(gidx_t i = 0; i < grid.size(); ++i) { |
There was a problem hiding this comment.
This will incur a race condition on the lonlatIt iterator.
Probably what you want is this (not tested):
const size_t num_threads = atlas_omp_get_max_threads();
const size_t size = grid.size();
atlas_omp_parallel {
// This is now executed concurrently in each available OpenMP threads
// All thread-private data here:
const size_t thread_num = atlas_omp_get_thread_num();
const size_t thread_begin = (thread_num * size) / num_threads;
const size_t thread_end = (thread_num + 1) * size) / num_threads;
auto lonlatIt = grid.lonlat().begin() + thread_begin; // Move iterator to thread_begin
for (size_t i = thread_begin; i<thread_end; ++i, ++lonlatIt) { // regular for, already within an OpenMP thread
const auto& lonlat = *lonlatIt;
partitioning[i] = finder.getCell(lonlat, listSize, edgeEpsilon, epsilon).isect ? mpi_rank : -1;
}
}There was a problem hiding this comment.
The lon-lat iterator is random access, so const auto& lonLat = *(lonlatIt + i) should work, right? Feels a bit tidier if I can avoid direct omp library calls.
There was a problem hiding this comment.
This won't work. A small test program shows this, with comments inside.
#include "atlas/library.h"
#include "atlas/runtime/Log.h"
#include "atlas/grid.h"
using namespace atlas;
int main(int argc, char* argv[]) {
atlas::initialize(argc,argv);
Grid grid{"CS-LFR-24"};
// Following bounds can be chunked per thread.
int ibegin = 10;
int iend = 11;
int isize = iend - ibegin;
const auto begin = grid.xy().begin() + ibegin;
// ------------------------------------------------------
// 1) Dereference temporary iterator: Works... **NOT**
// AND possibly very expensive to do the random access
// finding tile, finding row, finding columns, ...)
// versus "+1" which increments the already contained state.
for (int i=0; i<isize; ++i) {
const PointXY& p = *(begin+i); // creates temporary iterator and goes out of scope
// p is now dangling.
ATLAS_DEBUG_VAR(p);
}
// OUTPUT: DEBUG( p : {0,0} )
// ------------------------------------------------------
// 2) Copy iterator: **WORKS** but possibly very expensive (see (1))
// PLUS ADDITIONAL copy construction of iterator in inner loop.
for (int i=0; i<isize; ++i) {
auto it = begin + i;
const PointXY& p = *it;
ATLAS_DEBUG_VAR(p);
}
// OUTPUT: DEBUG( p : {39.375,-43.125} )
// ------------------------------------------------------
// 3) Standard iterator moved into start position: **WORKS**
// The iterator is only constructed once per OpenMP thread,
// when we apply the manual chunking (not one loop iteration per thread!)
auto it = begin;
for (int i=0; i<isize; ++i, ++it) {
const PointXY& p = *(it);
ATLAS_DEBUG_VAR(p);
}
// OUTPUT: DEBUG( p : {39.375,-43.125} )
// ------------------------------------------------------
atlas::finalize();
return 0;
}There was a problem hiding this comment.
Yikes. Consider me educated.
There was a problem hiding this comment.
More worrying, if we're having this debate, then the affected code is clearly untested...
I'd better sort that out!
| std::to_string(i) + "."); | ||
| } | ||
|
|
||
| tileIndex.push_back(tijView(cell.idx, 0)); |
There was a problem hiding this comment.
tileIndex is not used elsewhere? dead code?
There was a problem hiding this comment.
It's used in our model interface, but I plan on removing the code that before we advance our current Atlas tag.
Ah, I forgot about the triplet skipping! I'll do that then. |
I've simplified the Triplets vector now. I'm getting sloppy. |
wdeconinck
left a comment
There was a problem hiding this comment.
- A test is needed to illustrate the problem in the lonlat iterator of the matchingmeshpartitionercubesphere.
- Once failing, fix it, by updating the lonlat iterator in the MatchingMeshPartitionerCubedSphere.
… Let's get rid of it.
| EXPECT_APPROX_EQ(lonLat.lat(), refLonLat.lat(), 1e-14); | ||
|
|
||
| // Only now, *(begin + i), do you have my permission to die. | ||
| } |
There was a problem hiding this comment.
So, funny thing, const lvalue refs to temporary object members can extend the lifetime of the host object...
https://quuxplusone.github.io/blog/2020/11/16/lifetime-extension-tidbit/
There was a problem hiding this comment.
Not that this is okay! But it's a little tricky to show that technically valid C++ is broken.
There was a problem hiding this comment.
This seems like the worst code practice and should not be relied on for sure :)
There was a problem hiding this comment.
Indeed! It's a code stink!
I'll remove the test on the next push.
src/atlas/interpolation/method/cubedsphere/CubedSphereBilinear.cc
Outdated
Show resolved
Hide resolved
src/atlas/interpolation/method/cubedsphere/CubedSphereBilinear.cc
Outdated
Show resolved
Hide resolved
|
I've removed the redundant headers and the spooky C++ test. I went with the I'm happy to go with the other solution, but it seems like it just moves the complexity from an odd dereferencing pattern to more explicit thread management. |
The only issue is that this could be much more expensive than you think. Perhaps it's worth to benchmark the difference between:
|
Oh. I've just found the implementation of the 🤦 |
|
I present option 4, because why not? 🙃 |
I'm still not vouching for this solution, but if you're happy with the performance, and it's an improvement over the non-threaded version then all good for me. |
There's time for it to go wrong yet! nvhpc hates lambdas with OpenMP! |
|
So my understanding with this proposal (4), is that essentially this iterates sequentially, stores it in a vector of points as unstructured grid, and then uses OpenMP to compute the partitioning, using faster random access in the vector. |
Yeah, that's the gist of it! I don't think we ever actually use this partitioner on other cubed spheres. I didn't realise (but should have expected) that the cubed sphere iterator doesn't do random access properly. It's all such a hacky mess, but I think I've got a credible plan to safely dispose of it... |
* release/0.43.0: (61 commits) Update Changelog Version 0.43.0 Change defaults of structured interpolation methods, originally modified with [ed0996d - Deprecate factory builders for structured interpolation methods] Added OpenMP to the cubed-sphere interpolator and the matching cubed sphere partitioner. (#293) Update ci-hpc-config.yml: turn off floating-point trapping due to openmpi problem Update ci-hpc-config.yml Fix warning on using uninitialized variable Fix warnings and memory leak in test Add tests that verifies running code referencing device_data on CPU Fix stridesf when host_data == device_data, e.g. when running on CPU only Only assert an arrays device is allocated when we have devices during make_device_view Add atlas_acc_pragma Disable ECKIT_GEO on for ci-hpc for now pluto: Fortran API for 'integer pluto%devices()' Turn off ATLAS_DEPRECATION_WARNINGS for the moment, to be enabled after upcoming release Deprecate factory builders for structured interpolation methods Add Factory deprecation mechanism Apply interolation::NonLinear to arrays using another field's metadata PointCloud: delay setup of halo_exchange and gather Remove scheduling keywords from omp pragma ...
This adds some OpenMP that was skipped in the initial implementation. Similar in intent to #292
I'd like to eventually retire this interpolator in favour of finite-element interpolation, but we're not quite ready yet.
I've also removed a nasty metadata code smell which was needed by the old cubed-sphere wind interpolation scheme. I've accordingly removed the test that used the metadata.