Hello,
It seems that the Triton or PyTorch implementation for efficiency experiments does not include attention bias within its attention operation.
Am I understanding this correctly? If so, I’m curious as to why attention bias is not included in these efficiency experiments.
Is it because the bias is not considered important for the experiments, or is there another reason?
Thanks.