-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Open
Description
I am trying to use the DeepSpeedTransformerLayer and wondering what format the attention mask should be for left to right language model training.
From https://github.com/microsoft/DeepSpeed/blob/44bd538b110ce0e8fc69626854631c3aee0dc094/tests/unit/test_cuda_forward.py#L181 , it seems like (bs, 1, seq_len, seq_len) could be correct,
but input_size: torch.Size([1, 501, 512]) and input_mask.shape=[1, 501, 501] raises
input_mask = torch.cat((input_mask, torch.ones((inp_size[0], input_mask.shape[1], input_mask.shape[2], \> (16 - (inp_size[1] % 16))), device=input_mask.device, dtype=input_mask.dtype) * -10000), 3)
E IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
There is no docstring so I figured I'd ask. Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels