Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erase_and_mask masking sub-words #2

Closed
muelletm opened this issue Mar 22, 2022 · 3 comments
Closed

erase_and_mask masking sub-words #2

muelletm opened this issue Mar 22, 2022 · 3 comments

Comments

@muelletm
Copy link

muelletm commented Mar 22, 2022

Is erase_and_mask working as intended?

def erase_and_mask(s, tokenizer, mask_len=5):
    """
    Randomly replace a span in input s with "[MASK]".
    """
    if len(s) <= mask_len: return s
    if len(s) < 30: return s
    ind = np.random.randint(len(s)-mask_len)
    left, right = s.split(s[ind:ind+mask_len], 1)
    return " ".join([left, "[MASK]", right]) 

I am a bit confused because it's working on raw sentences rather then tokens.
So random_span_mask actually refers to a character range rather than a token range.
As a result it will often break words.
Is that intended?
And if so, what's the rationale behind that?
Is it some kind of BPE dropout?

It's also a bit strange because the tokenizer is not used at all in the function.

Some examples produced by running mirror_scripts/mirror_sentence_roberta_drophead.sh.

At Least 66 Killed in Bomb Blasts in Iraq
At Least 66 [MASK] ed in Bomb Blasts in Iraq

Three people on skis are standing behind a no skiing sign.
Three people [MASK] kis are standing behind a no skiing sign.

A young boy is playing a wind instrument.
A young  [MASK] s playing a wind instrument.

A woman is holding a dancing baby up.
A woman is holding a dan [MASK] baby up.

Israel shoots down drone from Lebanon
Israel shoots  [MASK] drone from Lebanon
@muelletm
Copy link
Author

Just had a look at the paper, the example shows clearly that it is intentional.

So that just leaves the question for the rationale.

@hardyqr
Copy link
Collaborator

hardyqr commented Mar 22, 2022

Hey Thomas, yes this is intentional. We also tried token-level dropout and it turned out that simply random masking a span (consecutive character of certain length) without considering tokeniser worked better.

There has been follow-up work along this line hinted why this might be better. In this paper, Wu et al. argue that the model could exploit the length bias to identify if two sentences are similar. Specifically, in contrastive learning there is the shortcut that the model could be simply looking at if two sentences are with the same number of tokens to decide if they are a positive pair. They tried to bypass this problem by repeating words and observed improvement. Random span masking has also (in most cases) forced the two sequences that form a positive to have different number of tokens. So, my conjecture is that this is two sides of the same coin.

@muelletm
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants