Skip to content

Added OpenMP to the cubed-sphere interpolator and the matching cubed sphere partitioner.#293

Merged
wdeconinck merged 16 commits intoecmwf:developfrom
JCSDA-internal:feature/cubed_sphere_interp_omp
Jun 25, 2025
Merged

Added OpenMP to the cubed-sphere interpolator and the matching cubed sphere partitioner.#293
wdeconinck merged 16 commits intoecmwf:developfrom
JCSDA-internal:feature/cubed_sphere_interp_omp

Conversation

@odlomax
Copy link
Contributor

@odlomax odlomax commented Jun 6, 2025

This adds some OpenMP that was skipped in the initial implementation. Similar in intent to #292

I'd like to eventually retire this interpolator in favour of finite-element interpolation, but we're not quite ready yet.

I've also removed a nasty metadata code smell which was needed by the old cubed-sphere wind interpolation scheme. I've accordingly removed the test that used the metadata.

Copy link
Member

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @odlomax this is a good idea. I have a few suggestions below.

Also I'm wondering why the complexity of the variant and visitor. You can directly insert into an oversized triplet vector, like done in #292 . The allocated memory is anyway the same because the variant takes the size of the largest type. Then you also avoid the extra visitor and copy into a new vector of triplets. The eckit::SparseMatrix constructor skips over empty triplets during construction.

// Loop over grid and set partioning[].
auto lonlatIt = grid.lonlat().begin();
for (gidx_t i = 0; i < grid.size(); ++i) {
atlas_omp_parallel_for(gidx_t i = 0; i < grid.size(); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will incur a race condition on the lonlatIt iterator.
Probably what you want is this (not tested):

const size_t num_threads = atlas_omp_get_max_threads();
const size_t size        = grid.size();
atlas_omp_parallel {
    // This is now executed concurrently in each available OpenMP threads
    // All thread-private data here:
    const size_t thread_num   = atlas_omp_get_thread_num();
    const size_t thread_begin = (thread_num * size) / num_threads;
    const size_t thread_end   = (thread_num + 1) * size) / num_threads;
    auto lonlatIt = grid.lonlat().begin() + thread_begin; // Move iterator to thread_begin
    for (size_t i = thread_begin; i<thread_end; ++i, ++lonlatIt) { // regular for, already within an OpenMP thread
        const auto& lonlat = *lonlatIt;
        partitioning[i] = finder.getCell(lonlat, listSize, edgeEpsilon, epsilon).isect ? mpi_rank : -1;
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lon-lat iterator is random access, so const auto& lonLat = *(lonlatIt + i) should work, right? Feels a bit tidier if I can avoid direct omp library calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work. A small test program shows this, with comments inside.

#include "atlas/library.h"
#include "atlas/runtime/Log.h"
#include "atlas/grid.h"
using namespace atlas;

int main(int argc, char* argv[]) {
    atlas::initialize(argc,argv);

    Grid grid{"CS-LFR-24"};
    // Following bounds can be chunked per thread.
    int ibegin = 10;
    int iend   = 11;   
    int isize  = iend - ibegin;
    const auto begin = grid.xy().begin() + ibegin;

// ------------------------------------------------------

// 1) Dereference temporary iterator: Works... **NOT**
//      AND possibly very expensive to do the random access
//      finding tile, finding row, finding columns, ...)
//      versus "+1" which increments the already contained state.
    for (int i=0; i<isize; ++i) {
        const PointXY& p = *(begin+i); // creates temporary iterator and goes out of scope
        // p is now dangling.
        ATLAS_DEBUG_VAR(p);
    }
// OUTPUT: DEBUG( p : {0,0} )

// ------------------------------------------------------

// 2) Copy iterator: **WORKS** but possibly very expensive (see (1))
//          PLUS ADDITIONAL copy construction of iterator in inner loop.
    for (int i=0; i<isize; ++i) {
        auto it = begin + i;
        const PointXY& p = *it;
        ATLAS_DEBUG_VAR(p);
    }
// OUTPUT: DEBUG( p : {39.375,-43.125} )

// ------------------------------------------------------

// 3) Standard iterator moved into start position: **WORKS**
//         The iterator is only constructed once per OpenMP thread,
//         when we apply the manual chunking (not one loop iteration per thread!)
    auto it = begin;
    for (int i=0; i<isize; ++i, ++it) {
        const PointXY& p = *(it);
        ATLAS_DEBUG_VAR(p);
    }
// OUTPUT: DEBUG( p : {39.375,-43.125} )

// ------------------------------------------------------
    atlas::finalize();
    return 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. Consider me educated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More worrying, if we're having this debate, then the affected code is clearly untested...
I'd better sort that out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that too 👍

std::to_string(i) + ".");
}

tileIndex.push_back(tijView(cell.idx, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tileIndex is not used elsewhere? dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in our model interface, but I plan on removing the code that before we advance our current Atlas tag.

@odlomax
Copy link
Contributor Author

odlomax commented Jun 6, 2025

Thanks @odlomax this is a good idea. I have a few suggestions below.

Also I'm wondering why the complexity of the variant and visitor. You can directly insert into an oversized triplet vector, like done in #292 . The allocated memory is anyway the same because the variant takes the size of the largest type. Then you also avoid the extra visitor and copy into a new vector of triplets. The eckit::SparseMatrix constructor skips over empty triplets during construction.

Ah, I forgot about the triplet skipping! I'll do that then.

@odlomax
Copy link
Contributor Author

odlomax commented Jun 9, 2025

Thanks @odlomax this is a good idea. I have a few suggestions below.

Also I'm wondering why the complexity of the variant and visitor. You can directly insert into an oversized triplet vector, like done in #292 . The allocated memory is anyway the same because the variant takes the size of the largest type. Then you also avoid the extra visitor and copy into a new vector of triplets. The eckit::SparseMatrix constructor skips over empty triplets during construction.

I've simplified the Triplets vector now. I'm getting sloppy.

Copy link
Member

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A test is needed to illustrate the problem in the lonlat iterator of the matchingmeshpartitionercubesphere.
  • Once failing, fix it, by updating the lonlat iterator in the MatchingMeshPartitionerCubedSphere.

EXPECT_APPROX_EQ(lonLat.lat(), refLonLat.lat(), 1e-14);

// Only now, *(begin + i), do you have my permission to die.
}
Copy link
Contributor Author

@odlomax odlomax Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wdeconinck

So, funny thing, const lvalue refs to temporary object members can extend the lifetime of the host object...

https://quuxplusone.github.io/blog/2020/11/16/lifetime-extension-tidbit/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is okay! But it's a little tricky to show that technically valid C++ is broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the worst code practice and should not be relied on for sure :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! It's a code stink!

I'll remove the test on the next push.

@odlomax
Copy link
Contributor Author

odlomax commented Jun 24, 2025

I've removed the redundant headers and the spooky C++ test.

I went with the const auto lonLat = *(lonlatIt + i) assignment in the end. The presence of the + operator on the iterator expresses intent that it should be random access. I think std::random_access_iterator_tag is actually set someone underneath the layers.

I'm happy to go with the other solution, but it seems like it just moves the complexity from an odd dereferencing pattern to more explicit thread management.

@wdeconinck
Copy link
Member

I've removed the redundant headers and the spooky C++ test.

I went with the const auto lonLat = *(lonlatIt + i) assignment in the end. The presence of the + operator on the iterator expresses intent that it should be random access. I think std::random_access_iterator_tag is actually set someone underneath the layers.

I'm happy to go with the other solution, but it seems like it just moves the complexity from an odd dereferencing pattern to more explicit thread management.

The only issue is that this could be much more expensive than you think. Perhaps it's worth to benchmark the difference between:

  • not using any openmp, with the old formulation.
  • using the *(lonlatIt+i)
  • using explicit chunking

@odlomax
Copy link
Contributor Author

odlomax commented Jun 24, 2025

I've removed the redundant headers and the spooky C++ test.
I went with the const auto lonLat = *(lonlatIt + i) assignment in the end. The presence of the + operator on the iterator expresses intent that it should be random access. I think std::random_access_iterator_tag is actually set someone underneath the layers.
I'm happy to go with the other solution, but it seems like it just moves the complexity from an odd dereferencing pattern to more explicit thread management.

The only issue is that this could be much more expensive than you think. Perhaps it's worth to benchmark the difference between:

  • not using any openmp, with the old formulation.
  • using the *(lonlatIt+i)
  • using explicit chunking

Oh.

I've just found the implementation of the += operator for the iterator in the CubedSphereGrid implementation...

🤦

@odlomax
Copy link
Contributor Author

odlomax commented Jun 24, 2025

I present option 4, because why not? 🙃

@wdeconinck
Copy link
Member

I present option 4, because why not? 🙃

I'm still not vouching for this solution, but if you're happy with the performance, and it's an improvement over the non-threaded version then all good for me.

@odlomax
Copy link
Contributor Author

odlomax commented Jun 24, 2025

I present option 4, because why not? 🙃

I'm still not vouching for this solution, but if you're happy with the performance, and it's an improvement over the non-threaded version then all good for me.

There's time for it to go wrong yet! nvhpc hates lambdas with OpenMP!

@wdeconinck
Copy link
Member

So my understanding with this proposal (4), is that essentially this iterates sequentially, stores it in a vector of points as unstructured grid, and then uses OpenMP to compute the partitioning, using faster random access in the vector.

@odlomax
Copy link
Contributor Author

odlomax commented Jun 24, 2025

So my understanding with this proposal (4), is that essentially this iterates sequentially, stores it in a vector of points as unstructured grid, and then uses OpenMP to compute the partitioning, using faster random access in the vector.

Yeah, that's the gist of it!

I don't think we ever actually use this partitioner on other cubed spheres. I didn't realise (but should have expected) that the cubed sphere iterator doesn't do random access properly.

It's all such a hacky mess, but I think I've got a credible plan to safely dispose of it...

@wdeconinck wdeconinck merged commit e4daac1 into ecmwf:develop Jun 25, 2025
171 of 187 checks passed
wdeconinck added a commit that referenced this pull request Jun 25, 2025
* release/0.43.0: (61 commits)
  Update Changelog
  Version 0.43.0
  Change defaults of structured interpolation methods, originally modified with [ed0996d - Deprecate factory builders for structured interpolation methods]
  Added OpenMP to the cubed-sphere interpolator and the matching cubed sphere partitioner. (#293)
  Update ci-hpc-config.yml: turn off floating-point trapping due to openmpi problem
  Update ci-hpc-config.yml
  Fix warning on using uninitialized variable
  Fix warnings and memory leak in test
  Add tests that verifies running code referencing device_data on CPU
  Fix stridesf when host_data == device_data, e.g. when running on CPU only
  Only assert an arrays device is allocated when we have devices during make_device_view
  Add atlas_acc_pragma
  Disable ECKIT_GEO on for ci-hpc for now
  pluto: Fortran API for 'integer pluto%devices()'
  Turn off ATLAS_DEPRECATION_WARNINGS for the moment, to be enabled after upcoming release
  Deprecate factory builders for structured interpolation methods
  Add Factory deprecation mechanism
  Apply interolation::NonLinear to arrays using another field's metadata
  PointCloud: delay setup of halo_exchange and gather
  Remove scheduling keywords from omp pragma
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants