CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

SaigyoujiYuyuko233 · 2022-08-08T06:53:51Z

Hi guys,

I'm using:

Model: codegen-2B-multi
GPU: GTX 1070 w/ 8G VRAM
Sys: Fedora36 5.18.16-200.fc36.x86_64
NV Drive 515.57 w/ CUDA 11.7

Using podman as container runtime with NV container toolkit

Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.18.4
Built:        Fri Jul 22 15:05:59 2022
OS/Arch:      linux/amd64

cat  /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

Command nvdia-smi works fine in container.

Problem

The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.

Is this a GPU compatibility issue? if yes, which GPU model is supported?

Any help will be appreciated!

The text was updated successfully, but these errors were encountered:

moyix · 2022-08-08T14:02:30Z

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

SaigyoujiYuyuko233 · 2022-08-08T14:07:41Z

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

Thanks for reply! I will get a better card later.

Frederisk · 2022-09-22T05:04:55Z

After my testing, now this works fine on 1060 (Compute Capability 6.1).

BTW, the Computer Capability of 1070 should also be 6.1.

thakkarparth007 · 2022-12-23T17:42:20Z

Closing this as @Frederisk finds it working. If still an issue, please reopen.

moyix mentioned this issue Aug 13, 2022

[FT][ERROR] CUDA runtime error: operation not supported #20

Open

thakkarparth007 closed this as completed Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

SaigyoujiYuyuko233 commented Aug 8, 2022 •

edited

moyix commented Aug 8, 2022

SaigyoujiYuyuko233 commented Aug 8, 2022

Frederisk commented Sep 22, 2022 •

edited

thakkarparth007 commented Dec 23, 2022

CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

Comments

SaigyoujiYuyuko233 commented Aug 8, 2022 • edited

Problem

moyix commented Aug 8, 2022

SaigyoujiYuyuko233 commented Aug 8, 2022

Frederisk commented Sep 22, 2022 • edited

thakkarparth007 commented Dec 23, 2022

SaigyoujiYuyuko233 commented Aug 8, 2022 •

edited

Frederisk commented Sep 22, 2022 •

edited