Ran into crashes when testing LLM.int8() from transformers

Hi, I was testing LLM.int8() on the [LongT5](https://huggingface.co/google/long-t5-tglobal-large) model, but I consistently ran into the following errors:

```
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /opt/conda/envs/python38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...

python3: /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu:375: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t*, const int8_t*, void*, float*, int, int, int) [with int FORMATB = 3; int DTYPE_OUT = 32; int SCALE_ROWS = 0; cublasLtHandle_t = cublasLtContext*; int8_t = signed char]: Assertion `false' failed.
Aborted
```

Sample script to reproduce:
```
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-large')
model_8bit = AutoModelForSeq2SeqLM.from_pretrained('google/t5-v1_1-large', device_map="auto", load_in_8bit=True)

sentences = ['hello world']

inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = model_8bit.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=256
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ran into crashes when testing LLM.int8() from transformers #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Ran into crashes when testing LLM.int8() from transformers #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions