Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Conversation

@felixvor
Copy link
Contributor

Tokens were previously hardcoded so only the default config of bert-base was compatible. If the vocab.txt of a model had PAD, SEP, CLS tokens in a different position, the processor would ignore the wrong tokens for whole-word-masking. If MASK was at the wrong position, the processor would crash.

Tokens IDs are now grabbed from the vocabulary directly and IDs are not hardcoded anymore. This should make any model compatible that has SEP, CLS, PAD and MASK anywhere in their vocab.txt.

Related Issue: #800

@julian-risch julian-risch self-requested a review June 17, 2021 07:46
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Clean and solid. 👍 One thing: There is still one line of code that mentions index 103. That should be changed to mask_token_id, shouldn't it?

tokens[index] = 103

Have you already tested the changes? If so, how? Maybe with this test case but the language model changed to bert-base-german-cased?
def test_lm_finetuning(caplog):

@felixvor
Copy link
Contributor Author

oh yes thanks i overlooked that!
for now i only started the training process for testing and saw that the processor was returning valid data and not crash. wasnt able to finish a training and check results yet. any other ideas for testing this that could make sense?

@julian-risch
Copy link
Member

That's good! I did some local testing withFARM/test/test_lm_finetuning.py and changed the language model to bert-base-german-cased there. Works. I found out that the token "I" has index 103 in bert-base-german-cased vocabulary. So I also tested with this token by adding it to FARM/test/samples/lm_finetuning/train-sample.txt. It works with your changes but it does not without them. Great! 👍 Ready to merge. Let us know how the finetuned model performs!

@julian-risch julian-risch merged commit 671b475 into deepset-ai:master Jun 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants