erase_and_mask masking sub-words #2

muelletm · 2022-03-22T17:38:15Z

Is erase_and_mask working as intended?

def erase_and_mask(s, tokenizer, mask_len=5):
    """
    Randomly replace a span in input s with "[MASK]".
    """
    if len(s) <= mask_len: return s
    if len(s) < 30: return s
    ind = np.random.randint(len(s)-mask_len)
    left, right = s.split(s[ind:ind+mask_len], 1)
    return " ".join([left, "[MASK]", right])

I am a bit confused because it's working on raw sentences rather then tokens.
So random_span_mask actually refers to a character range rather than a token range.
As a result it will often break words.
Is that intended?
And if so, what's the rationale behind that?
Is it some kind of BPE dropout?

It's also a bit strange because the tokenizer is not used at all in the function.

Some examples produced by running mirror_scripts/mirror_sentence_roberta_drophead.sh.

At Least 66 Killed in Bomb Blasts in Iraq
At Least 66 [MASK] ed in Bomb Blasts in Iraq

Three people on skis are standing behind a no skiing sign.
Three people [MASK] kis are standing behind a no skiing sign.

A young boy is playing a wind instrument.
A young  [MASK] s playing a wind instrument.

A woman is holding a dancing baby up.
A woman is holding a dan [MASK] baby up.

Israel shoots down drone from Lebanon
Israel shoots  [MASK] drone from Lebanon

The text was updated successfully, but these errors were encountered:

muelletm · 2022-03-22T17:42:34Z

Just had a look at the paper, the example shows clearly that it is intentional.

So that just leaves the question for the rationale.

hardyqr · 2022-03-22T17:57:26Z

Hey Thomas, yes this is intentional. We also tried token-level dropout and it turned out that simply random masking a span (consecutive character of certain length) without considering tokeniser worked better.

There has been follow-up work along this line hinted why this might be better. In this paper, Wu et al. argue that the model could exploit the length bias to identify if two sentences are similar. Specifically, in contrastive learning there is the shortcut that the model could be simply looking at if two sentences are with the same number of tokens to decide if they are a positive pair. They tried to bypass this problem by repeating words and observed improvement. Random span masking has also (in most cases) forced the two sequences that form a positive to have different number of tokens. So, my conjecture is that this is two sides of the same coin.

muelletm · 2022-03-25T08:27:46Z

Thanks!

muelletm closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

erase_and_mask masking sub-words #2

erase_and_mask masking sub-words #2

muelletm commented Mar 22, 2022 •

edited

muelletm commented Mar 22, 2022

hardyqr commented Mar 22, 2022

muelletm commented Mar 25, 2022

erase_and_mask masking sub-words #2

erase_and_mask masking sub-words #2

Comments

muelletm commented Mar 22, 2022 • edited

muelletm commented Mar 22, 2022

hardyqr commented Mar 22, 2022

muelletm commented Mar 25, 2022

muelletm commented Mar 22, 2022 •

edited