Masks are broken. #146

anj1 · 2023-08-07T23:25:50Z

This problem could either be in this repo or NeuralAttentionlib, but I'm posting it here since the example using Transformers.jl is easier to run and more informative.

Basically, the implementation of all attention masks is broken, at least when using the default setup. For example, if one makes two masks, one with all true and another with all false, one gets exactly the same output! You can verify that the output of the last two lines here is the same.

using Transformers, Transformers.Layers
using NeuralAttentionlib

m1 = NeuralAttentionlib.GenericAttenMask(ones(Bool,4,4))
m0 = NeuralAttentionlib.GenericAttenMask(zeros(Bool,4,4))

tb = TransformerBlock(1, 8, 8, 8)

x = randn(8, 4, 1)

tb.attention.layer.layer(x, m0)
tb.attention.layer.layer(x, m1)

My hypothesis is that this is due to the default GenericMaskOp being +.
Either way, it makes the package unusable.

The text was updated successfully, but these errors were encountered:

chengchingwen · 2023-08-08T12:34:11Z

No, it is not that the masks are broken. It is just no such a thing of "mask with all false". Attention with softmax is using really small value to remove the masked entries. In the case that everything is masked would have uniform attention score of the small value and thus everything is attent uniformly. The two lines are the same because the random initialized Transformer would usually attent uniformly. If you really need that behavior, you can use the component in NeuralAttentionlib to create new attention operator and call the Transformer constructor with that operator.

anj1 · 2023-08-08T20:10:58Z

The mask with all zeros is just an illustrative example to show the issue with the mask application logic.You can try with other kinds of masks and verify that the results are not as expected.

chengchingwen · 2023-08-11T14:48:04Z

There are tests that make sure the masks work as expected. Many model won't work if the CausalMask doesn't being applied. Another guess of your observation is the uniform attention score caused by the random initialization. You can try changing the distribution of the weights and rerun the model to see if the value are still the same. If you are pretty sure the mask is broken, please give a MWE with non-trivial masks (at least not mask with all zeros, as explained above).

anj1 · 2023-08-13T21:19:14Z

There are no tests that CausalMask actually works correctly, neither here nor in NeuralAttentionlib. In NeuralAttentionlib there are tests that CausalMask works in returning the correct value, but this is completely separate from determining if CausalMask actually works as intended during the transformer attention update.

Note that it is quite possible for the CausalMask to be failing in properly masking decoder tokens yet for the huggingface examples to still work.

chengchingwen · 2023-08-18T16:59:49Z

Note that it is quite possible for the CausalMask to be failing in properly masking decoder tokens yet for the huggingface examples to still work.

Again, you'll need to provide an example to make this discussion concrete.

The supported models are tested against huggingface transformer with the validation code in example folder and we do observe that missing masks would have huge impact on the numerical outputs.

chengchingwen closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masks are broken. #146

Masks are broken. #146

anj1 commented Aug 7, 2023

chengchingwen commented Aug 8, 2023

anj1 commented Aug 8, 2023

chengchingwen commented Aug 11, 2023

anj1 commented Aug 13, 2023

chengchingwen commented Aug 18, 2023

Masks are broken. #146

Masks are broken. #146

Comments

anj1 commented Aug 7, 2023

chengchingwen commented Aug 8, 2023

anj1 commented Aug 8, 2023

chengchingwen commented Aug 11, 2023

anj1 commented Aug 13, 2023

chengchingwen commented Aug 18, 2023