Skip to content

DeepSpeed Inference with FP16 (torch.half) -- RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half' #1313

@blueviggen

Description

@blueviggen

When I execute the following Python script with the DeepSpeed Inference API and a pre-trained BERT model (downloaded via HuggingFace Transformers API) , I'm getting the following error. I'm seeing this error on NVIDIA V100 and P100 at least.

File "/home/miniconda3/lib/python3.8/site-packages/transformers/pipelines/fill_mask.py", line 193, in __call__
    probs = logits.softmax(dim=-1)
RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

The root cause of this error is in PyTorch softmax FP16 support and the transformer library (fill_mask.py). By modifying the fill_mask.py line 193 as follows, the issue can be mitigated.

probs = logits.float().softmax(dim=-1)


from transformers import pipeline
import deepspeed

p = pipeline('fill-mask', model = 'bert-base-cased', device = 0)

p.model = deepspeed.init_inference(p.model, mp_size = 1, dtype = torch.half)

result = p("Hello I'm a [MASK] model.", do_sample=True, min_length=50)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions