I found that there is constraint on the dimensionality when we use the transformer cuda kernel: https://github.com/microsoft/DeepSpeed/blob/d720fdb6857f4b71d922ca1e8efbe5271b5fb7c2/csrc/transformer/normalize_kernels.cu#L232-L250
I wonder what is the reason behind it? Is there any plan to support arbitrary dimensionality? Or, If I want to use hidden_dim=4096 or 8192, what do I need to do to make it work? Thanks.