Skip to content

[BUG] Memory allocation error from Kokkos random number generator #132

@LudwigBoess

Description

@LudwigBoess

describe the bug
When running on 16 and more nodes on Aurora/SuperMUC-NG2 (Intel PVC nodes) entity crashes with the error:

FATAL : Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 128 MiB (label="Kokkos::Random_XorShift1024::state").
FATAL : see the `*.err` file for more details

Even though there should be plenty of free memory available.
@haykh mentioned that this is caused by the Kokkos random number generator.
This issue is meant to track that problem. We should either figure this out with the Kokkos developers or replace the random number generator.

code version
1.3.0rc on hash: 4ab9bf3f450f374c67659527bab2ef97d62b73b5

compiler/library versions
Intel compiler with MPI: IntelLLVM 2025.2.0 (oneapi_2025.2.0/mpi/2021.16)
Kokkos: 4.6.02 + 4.7.00 both show this problem.

cmake configuration command
On SuperMUC-NG2: cmake -B build -D pgen=streaming -D precision=single -D mpi=ON -D output=OFF -Dgpu_aware_mpi=OFF -DCMAKE_C_COMPILER=mpiicx -DCMAKE_CXX_COMPILER=mpiicpx

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions