`Nx.Random.shuffle/3` fails for large tensors on `cuda` backend

When running the following code on an EXLA backend with a CUDA GPU, through the `livebook:0.14.5-cuda12` image:
```elixir
key = Nx.Random.key(1)
input = Nx.iota({1_000_000})
{output, _new_key} = Nx.Random.shuffle(key, input)
output
```
the result looks like this:
```elixir
#Nx.Tensor<
  s32[1000000]
  EXLA.Backend<cuda:0, 0.477854050.2664562754.2456>
  [163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, 163918, ...]
>
```
The bug also occurs for smaller tensors, but with a lesser frequency. The deciding factor is the size of the axis used for the shuffle, the other dimensions of the tensor do not seem to be relevant. It seems to start happening with an axis size around `100,000`, and is guaranteed after `1,000,000`.

Moreover, about half the time the execution never completes and the Livebook runtime has to be restarted.

This has been observed on two different machines, one with an RTX4090 graphics card and one with a GTX1070ti. The bug did not occur during testing on the CPU on the same machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Nx.Random.shuffle/3` fails for large tensors on `cuda` backend #1551

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nx.Random.shuffle/3 fails for large tensors on cuda backend #1551

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Nx.Random.shuffle/3` fails for large tensors on `cuda` backend #1551