Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

felixvor · 2021-06-17T07:41:27Z

Tokens were previously hardcoded so only the default config of bert-base was compatible. If the vocab.txt of a model had PAD, SEP, CLS tokens in a different position, the processor would ignore the wrong tokens for whole-word-masking. If MASK was at the wrong position, the processor would crash.

Tokens IDs are now grabbed from the vocabulary directly and IDs are not hardcoded anymore. This should make any model compatible that has SEP, CLS, PAD and MASK anywhere in their vocab.txt.

Related Issue: #800

julian-risch

Looks good to me! Clean and solid. 👍 One thing: There is still one line of code that mentions index 103. That should be changed to mask_token_id, shouldn't it?

FARM/farm/data_handler/processor.py

Line 1751 in 6f9fad1

tokens[index] = 103

Have you already tested the changes? If so, how? Maybe with this test case but the language model changed to bert-base-german-cased?

FARM/test/test_lm_finetuning.py

Line 18 in d200c8c

def test_lm_finetuning(caplog):

felixvor · 2021-06-17T10:05:19Z

oh yes thanks i overlooked that!
for now i only started the training process for testing and saw that the processor was returning valid data and not crash. wasnt able to finish a training and check results yet. any other ideas for testing this that could make sense?

julian-risch · 2021-06-17T15:15:40Z

That's good! I did some local testing withFARM/test/test_lm_finetuning.py and changed the language model to bert-base-german-cased there. Works. I found out that the token "I" has index 103 in bert-base-german-cased vocabulary. So I also tested with this token by adding it to FARM/test/samples/lm_finetuning/train-sample.txt. It works with your changes but it does not without them. Great! 👍 Ready to merge. Let us know how the finetuned model performs!

felixvor added 2 commits June 17, 2021 09:31

dynamic token IDs in _mask_random_words of BertStyleLMProcessor

b8a0ced

remove newlines

6f9fad1

julian-risch self-requested a review June 17, 2021 07:46

julian-risch suggested changes Jun 17, 2021

View reviewed changes

use dynamic mask token id for replacement

d211ba6

julian-risch approved these changes Jun 17, 2021

View reviewed changes

julian-risch merged commit 671b475 into deepset-ai:master Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

Uh oh!

felixvor commented Jun 17, 2021

Uh oh!

julian-risch left a comment

Uh oh!

felixvor commented Jun 17, 2021

Uh oh!

julian-risch commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

Uh oh!

Conversation

felixvor commented Jun 17, 2021

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

felixvor commented Jun 17, 2021

Uh oh!

julian-risch commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants