Skip to content

Nx.Random.shuffle/3 fails for large tensors on cuda backend #1551

@VictorT42

Description

@VictorT42

When running the following code on an EXLA backend with a CUDA GPU, through the livebook:0.14.5-cuda12 image:

key = Nx.Random.key(1)
input = Nx.iota({1_000_000})
{output, _new_key} = Nx.Random.shuffle(key, input)
output

the result looks like this:

#Nx.Tensor<
  s32[1000000]
  EXLA.Backend<cuda:0, 0.477854050.2664562754.2456>
  [163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, ...]
>

The bug also occurs for smaller tensors, but with a lesser frequency. The deciding factor is the size of the axis used for the shuffle, the other dimensions of the tensor do not seem to be relevant. It seems to start happening with an axis size around 100,000, and is guaranteed after 1,000,000.

Moreover, about half the time the execution never completes and the Livebook runtime has to be restarted.

This has been observed on two different machines, one with an RTX4090 graphics card and one with a GTX1070ti. The bug did not occur during testing on the CPU on the same machines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions