You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
deferase_and_mask(s, tokenizer, mask_len=5):
""" Randomly replace a span in input s with "[MASK]". """iflen(s) <=mask_len: returnsiflen(s) <30: returnsind=np.random.randint(len(s)-mask_len)
left, right=s.split(s[ind:ind+mask_len], 1)
return" ".join([left, "[MASK]", right])
I am a bit confused because it's working on raw sentences rather then tokens.
So random_span_mask actually refers to a character range rather than a token range.
As a result it will often break words.
Is that intended?
And if so, what's the rationale behind that?
Is it some kind of BPE dropout?
It's also a bit strange because the tokenizer is not used at all in the function.
Some examples produced by running mirror_scripts/mirror_sentence_roberta_drophead.sh.
At Least 66 Killed in Bomb Blasts in Iraq
At Least 66 [MASK] ed in Bomb Blasts in Iraq
Three people on skis are standing behind a no skiing sign.
Three people [MASK] kis are standing behind a no skiing sign.
A young boy is playing a wind instrument.
A young [MASK] s playing a wind instrument.
A woman is holding a dancing baby up.
A woman is holding a dan [MASK] baby up.
Israel shoots down drone from Lebanon
Israel shoots [MASK] drone from Lebanon
The text was updated successfully, but these errors were encountered:
Hey Thomas, yes this is intentional. We also tried token-level dropout and it turned out that simply random masking a span (consecutive character of certain length) without considering tokeniser worked better.
There has been follow-up work along this line hinted why this might be better. In this paper, Wu et al. argue that the model could exploit the length bias to identify if two sentences are similar. Specifically, in contrastive learning there is the shortcut that the model could be simply looking at if two sentences are with the same number of tokens to decide if they are a positive pair. They tried to bypass this problem by repeating words and observed improvement. Random span masking has also (in most cases) forced the two sequences that form a positive to have different number of tokens. So, my conjecture is that this is two sides of the same coin.
Is
erase_and_mask
working as intended?I am a bit confused because it's working on raw sentences rather then tokens.
So
random_span_mask
actually refers to a character range rather than a token range.As a result it will often break words.
Is that intended?
And if so, what's the rationale behind that?
Is it some kind of BPE dropout?
It's also a bit strange because the tokenizer is not used at all in the function.
Some examples produced by running
mirror_scripts/mirror_sentence_roberta_drophead.sh
.The text was updated successfully, but these errors were encountered: