DeepSpeed Inference with FP16 (torch.half) -- RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

When I execute the following Python script with the DeepSpeed Inference API and a pre-trained BERT model (downloaded via HuggingFace Transformers API) , I'm getting the following error. I'm seeing this error on NVIDIA V100 and P100 at least.

    File "/home/miniconda3/lib/python3.8/site-packages/transformers/pipelines/fill_mask.py", line 193, in __call__
        probs = logits.softmax(dim=-1)
    RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

The root cause of this error is in PyTorch softmax FP16 support and the transformer library (fill_mask.py).  By modifying the fill_mask.py line 193 as follows, the issue can be mitigated.

   probs = logits.**float()**.softmax(dim=-1)

-------------------------------------------------------

from transformers import pipeline
import deepspeed

p = pipeline('fill-mask', model = 'bert-base-cased', device = 0)

p.model = deepspeed.init_inference(p.model, mp_size = 1, dtype = **torch.half**)

result = p("Hello I'm a [MASK] model.", do_sample=True, min_length=50)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed Inference with FP16 (torch.half) -- RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half' #1313

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed Inference with FP16 (torch.half) -- RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half' #1313

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions