When I execute the following Python script with the DeepSpeed Inference API and a pre-trained BERT model (downloaded via HuggingFace Transformers API) , I'm getting the following error. I'm seeing this error on NVIDIA V100 and P100 at least.
File "/home/miniconda3/lib/python3.8/site-packages/transformers/pipelines/fill_mask.py", line 193, in __call__
probs = logits.softmax(dim=-1)
RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
The root cause of this error is in PyTorch softmax FP16 support and the transformer library (fill_mask.py). By modifying the fill_mask.py line 193 as follows, the issue can be mitigated.
probs = logits.float().softmax(dim=-1)
from transformers import pipeline
import deepspeed
p = pipeline('fill-mask', model = 'bert-base-cased', device = 0)
p.model = deepspeed.init_inference(p.model, mp_size = 1, dtype = torch.half)
result = p("Hello I'm a [MASK] model.", do_sample=True, min_length=50)