Skip to content

Ran into crashes when testing LLM.int8() from transformers #18

@changlan

Description

@changlan

Hi, I was testing LLM.int8() on the LongT5 model, but I consistently ran into the following errors:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /opt/conda/envs/python38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...

python3: /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu:375: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t*, const int8_t*, void*, float*, int, int, int) [with int FORMATB = 3; int DTYPE_OUT = 32; int SCALE_ROWS = 0; cublasLtHandle_t = cublasLtContext*; int8_t = signed char]: Assertion `false' failed.
Aborted

Sample script to reproduce:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-large')
model_8bit = AutoModelForSeq2SeqLM.from_pretrained('google/t5-v1_1-large', device_map="auto", load_in_8bit=True)

sentences = ['hello world']

inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = model_8bit.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=256
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions