Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA runtime error: invalid device function sampling_topp_kernels.cu #14

Closed
SaigyoujiYuyuko233 opened this issue Aug 8, 2022 · 4 comments

Comments

@SaigyoujiYuyuko233
Copy link

SaigyoujiYuyuko233 commented Aug 8, 2022

Hi guys,

I'm using:

  • Model: codegen-2B-multi
  • GPU: GTX 1070 w/ 8G VRAM
  • Sys: Fedora36 5.18.16-200.fc36.x86_64
  • NV Drive 515.57 w/ CUDA 11.7

Using podman as container runtime with NV container toolkit

Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.18.4
Built:        Fri Jul 22 15:05:59 2022
OS/Arch:      linux/amd64
cat  /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

Command nvdia-smi works fine in container.

image

Problem

The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.
image
image

Is this a GPU compatibility issue? if yes, which GPU model is supported?

Any help will be appreciated!

@moyix
Copy link
Collaborator

moyix commented Aug 8, 2022

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

@SaigyoujiYuyuko233
Copy link
Author

Hmm, FasterTransformer has only been tested on Compute Capability >= 7.0, and the 1070 is 6.0. So it's possible something it uses is limited to more recent cards. For now I'll add a note to the README but I'll leave this open to investigate further.

The line that's failing is:

            check_cuda_error(
                cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
                                                                   cub_temp_storage_size,
                                                                   log_probs,
                                                                   (T*)nullptr,
                                                                   id_vals,
                                                                   (int*)nullptr,
                                                                   vocab_size * batch_size,
                                                                   batch_size,
                                                                   begin_offset_buf,
                                                                   offset_buf + 1,
                                                                   0,              // begin_bit
                                                                   sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
                                                                   stream));       // cudaStream_t

Thanks for reply! I will get a better card later.

@Frederisk
Copy link
Contributor

Frederisk commented Sep 22, 2022

After my testing, now this works fine on 1060 (Compute Capability 6.1).

BTW, the Computer Capability of 1070 should also be 6.1.

@thakkarparth007
Copy link
Collaborator

Closing this as @Frederisk finds it working. If still an issue, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants