-
Notifications
You must be signed in to change notification settings - Fork 8
Description
describe the bug
When running on 16 and more nodes on Aurora/SuperMUC-NG2 (Intel PVC nodes) entity crashes with the error:
FATAL : Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 128 MiB (label="Kokkos::Random_XorShift1024::state").
FATAL : see the `*.err` file for more details
Even though there should be plenty of free memory available.
@haykh mentioned that this is caused by the Kokkos
random number generator.
This issue is meant to track that problem. We should either figure this out with the Kokkos
developers or replace the random number generator.
code version
1.3.0rc on hash: 4ab9bf3f450f374c67659527bab2ef97d62b73b5
compiler/library versions
Intel compiler with MPI: IntelLLVM 2025.2.0 (oneapi_2025.2.0/mpi/2021.16)
Kokkos: 4.6.02 + 4.7.00 both show this problem.
cmake configuration command
On SuperMUC-NG2: cmake -B build -D pgen=streaming -D precision=single -D mpi=ON -D output=OFF -Dgpu_aware_mpi=OFF -DCMAKE_C_COMPILER=mpiicx -DCMAKE_CXX_COMPILER=mpiicpx