Add option to use fast HF tokenizer. #482

PhilipMay · 2020-08-01T12:00:44Z

This PR adds the option to use fast HF tokenizer.

The reason why I need this is the following:
I plan to open source a German Electra model which is lower case but does not strip accents. To do that you have to specify strip_accents=False to the tokenizer. But this option is only available for the fast tokenizer. Also see here: huggingface/transformers#6186 and here google-research/electra#88

Might also solve #157

To-do before done

write tests
review
merge with WIP Add fast rust tokenizers #205
fix CI problems

PhilipMay · 2020-08-01T12:08:22Z

I just saw that there is a very similar PR here: #205
My suggestion / offer: I can continue with this PR and will try to merge the stuff from #205

@tholor @Timoeller do you want to have fast set by default? HF not having it as default in tokenizers.AutoTokenizer.
My suggestion is: Do not use fast tokenizer as default because it might be a breaking change.

tholor · 2020-08-01T16:29:25Z

Hey @PhilipMay, would be great if you can continue with the integration of the fast rust tokenizers. We were actually blocked in #205 for quite some time due to missing support for multiprocessing, but we should be good to go now.
We see two stages of integration:

arg to optionally enable fast tokenizer
hopefully switching to it completely soon and getting rid of FARMs tokenize_with_metadata(). It's main purpose was to create the offset mapping which is crucial for aligning token level predictions back to original string space (e.g for QA). This is now also available in the rust tokenizers. We should get some additional speed improvements from this full switch as we avoid tokenizer calls in the word level.

For 2) we would need to understand in more detail if there were any breaking changes (e.g are all model types by now supported?) or tokenization differences (e.g Roberta's prefix whitespace)

We really appreciate any support from you here, but can also take over at any point (especially for 2.)

BTW: Great that you'll open source a German Electra model! We have trained some variants there too and found them to be very effective. Will publish them soon together with a paper.

PhilipMay · 2020-08-01T16:35:18Z

Hey @tholor - thanks for the answer. Is it ok for you to separate point 1 and 2 from each other in different PRs? I think they can be handled in steps.

tholor · 2020-08-01T16:38:23Z

Yes, absolutely!
I believe 1) will help us to understand if 2) is feasible :)

PhilipMay · 2020-08-01T16:41:03Z

Yes, absolutely!
I believe 1) will help us to understand if 2) is feasible :)

Great - so I will continue with this... :-)

PhilipMay · 2020-08-01T16:41:40Z

@tholor Is it ok with you to set "slow" as default? So we have no breaking change?

- set num_processes=0 for Inferencer

PhilipMay · 2020-08-02T07:13:28Z

If "slow" tokenizer as default is ok with you - this PR can be merged from my point of view.

tholor

Looking great! Thanks for working on this! Two small suggestions from my side...

test/test_tokenization.py

farm/infer.py

- electra - roberta

PhilipMay · 2020-08-03T18:48:04Z

@tholor When doing "normal" text classification witha fast tokenizer like in https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification.py I have the following problem:

08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Basket id: id_internal: 4879, id_external: None
08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Error message: TextInputSequence must be str
08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Could not convert this sample to features:

The is coming from here: https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L299

Do you have an idea?

tholor · 2020-08-04T07:11:09Z

Hmm...how does the sample look like that causes trouble here? Is it printed after the above error message? If not, maybe set a breakpoint there.
Possibly the text is 'None'?

Is that happening for all samples or only some in your dataset?

tholor · 2020-08-05T11:50:24Z

Sure. I am currently on vacation with only partial internet access, but I will try to have a look at it tomorrow.

PhilipMay · 2020-08-05T12:28:30Z

Sure. I am currently on vacation with only partial internet access, but I will try to have a look at it tomorrow.

That would be great! Thanks. ;-)

- add shape assert - fix embedding assert

tholor · 2020-08-06T11:53:49Z

I investigated the problem with is_pretokenized. It seems to be an issue upstream with subword tokens for the slow tokenizers: huggingface/transformers#6046
I added a quick fix. However, CI is still not passing. Could you please check if that's something you can tackle?
If not, I am happy to dig in deeper again, but this might take some more time as I am in nature with quite limited internet bandwidth and many tests are a hassle due to the relatively large model downloads.
The next failing test (test_embeddings_extraction() ) seems to be about slightly different vector values. If the diff is negligible, we could relax the assert there a bit.

PhilipMay · 2020-08-07T19:54:50Z

There seem to be more bugs with fast and slow tokenizers. To be honest I do not feel like debugging all this stuff at the moment...
Will write a comment to the upstream issue though.

tholor · 2020-08-08T04:58:55Z

Ok sure. We will take over.
As mentioned above this can take a while though.

PhilipMay · 2020-08-08T09:03:45Z

Well, thanks! :-)

bogdankostic · 2020-08-18T14:35:08Z

Fast tokenizers seem to have the same problem with is_pretokenized. There doesn't seem to be a quick fix as is the case for slow tokenizers. I have opened an issue in transformers.

PhilipMay · 2020-08-25T16:30:08Z

Did you see this commen huggingface/transformers#6046 (comment) ?

bogdankostic · 2020-08-31T08:24:44Z

Yes, we saw that. Currently trying to come up with a solution for this.

…t_tokenizer

bogdankostic · 2020-09-01T08:45:39Z

A quick update on what I have done so far, @PhilipMay.
I solved the problem about not being able to pass already tokenized sequences to FastTokenizer by passing the raw text. However, this comes with the downside of having to tokenize twice, once when initializing the samples to get the offsets and once when extracting the input ids to featurize the samples, making the use of a FastTokenizer maybe pointless.
I chose this approach to not introduce any major changes.

One question that I came up with concerns line 107 in tokenizytion.py. Why is an error raised if the user wants to use a FastTokenizer with a Roberta model? According to Huggingface documentation, fast tokenizers are available for Roberta models. Do they behave differently than the other tokenizers?

Timoeller · 2020-09-01T08:53:12Z

Why is an error raised if the user wants to use a FastTokenizer with a Roberta model?

My 1.5 cents:
I am more wondering how a XLM-R model can be used with Rust tokenizers, since the tokenizer is a sentencepiece model.
Maybe this use_fast check for RoBERTa is an artifact?

bogdankostic · 2020-09-01T16:44:43Z

This PR, as it is now, makes it possible to use fast tokenizers with Bert models, Distilbert models and Electra models. Although fast tokenizers exist for Roberta models, fast Roberta tokenizers are not supported yet. This is because to get the input-IDs, for certain tasks we need to detokenize the input. However, Roberta treats spaces like parts of the tokens which makes detokenizing more complex.

In a future PR we should implement fast tokenizers such that we get the samples' features while tokenizing and extracting the offsets. Like this, we would save one tokenizer step and could use fast tokenizers with Roberta models.

tholor

Looking good

PhilipMay · 2020-09-02T08:39:04Z

One question that I came up with concerns line 107 in tokenizytion.py. Why is an error raised if the user wants to use a FastTokenizer with a Roberta model?

[...] Although fast tokenizers exist for Roberta models, fast Roberta tokenizers are not supported yet. This is because to get the input-IDs, for certain tasks we need to detokenize the input. However, Roberta treats spaces like parts of the tokens which makes detokenizing more complex.

I think that was the reason why I did that on my early PR commits.

Add option to use fast HF tokenizer

651463a

PhilipMay mentioned this pull request Aug 1, 2020

WIP Add fast rust tokenizers #205

Closed

PhilipMay added 4 commits August 1, 2020 14:14

Hand merge tests from PR deepset-ai#205

a433483

test_inferencer_with_fast_bert_tokenizer

c20c5db

test_fast_bert_tokenizer

5f2b5ee

test_fast_bert_tokenizer_strip_accents

fa3bd67

test_fast_electra_tokenizer

cd7298c

Fix OOM issue of CI

01e5ffb

- set num_processes=0 for Inferencer

tholor requested changes Aug 2, 2020

View reviewed changes

test/test_tokenization.py Outdated Show resolved Hide resolved

farm/infer.py Show resolved Hide resolved

PhilipMay added 5 commits August 2, 2020 14:31

Extend test for fast tokenizer

42f345f

- electra - roberta

test_fast_tokenizer for more model typed

9b021ff

- electra - roberta

Fix tokenize_with_metadata

86d7fd5

Split tokenizer tests

a8f4638

Fix pytest params bug in test_tok

cdccafa

PhilipMay added 4 commits August 4, 2020 09:53

Fix fast tokenizer usage

47d4b6a

add missing newline eof

8318063

Add test fast tok. doc_callif.

8c61e3b

Remove RobertaTokenizerFast

aec7d2d

PhilipMay force-pushed the add_fast_tokenizer branch from f286d43 to aec7d2d Compare August 4, 2020 12:12

PhilipMay added 2 commits August 4, 2020 21:14

Fix Tokenizer load and save.

75ea9dd

Fix typo

2d2cd00

PhilipMay and others added 7 commits August 5, 2020 17:12

Improve test test_embeddings_extraction

8afa136

- add shape assert - fix embedding assert

Dosctring for fast tokenizers improved

042fde0

tokenizer_args docstring

7ed385f

Extend test_embeddings_extraction to fast tok.

d4eb59c

extend test_ner with fast tok.

4f87604

fix sample_to_features_ner for fast tokenizer

bc7abca

temp fix for is_pretokenized until fixed upstream

da9c2f5

bogdankostic added 2 commits August 25, 2020 18:00

Make use of fast tokenizer possible + fix bug in offset calculation

19cc211

Merge branch 'master' into add_fast_tokenizer

6d0a3c1

bogdankostic added 2 commits August 31, 2020 17:57

Make fast tokenization possible with NER, LM and QA

7e75de1

Merge remote-tracking branch 'origin/add_fast_tokenizer' into add_fas…

0e4b1b0

…t_tokenizer

bogdankostic added 2 commits September 1, 2020 11:10

Change error messages

eb46629

Add tests

06d51c0

update error messages, comments and truncation arg in tokenizer

1acaff4

tholor approved these changes Sep 2, 2020

View reviewed changes

tholor merged commit 435f3ee into deepset-ai:master Sep 2, 2020

PhilipMay mentioned this pull request Sep 7, 2020

Warning message spam "please use truncation=True to explicitely truncate examples to max length" #525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to use fast HF tokenizer. #482

Add option to use fast HF tokenizer. #482

PhilipMay commented Aug 1, 2020 •

edited

PhilipMay commented Aug 1, 2020 •

edited

tholor commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

tholor commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 2, 2020

tholor left a comment

PhilipMay commented Aug 3, 2020

tholor commented Aug 4, 2020

tholor commented Aug 5, 2020

PhilipMay commented Aug 5, 2020

tholor commented Aug 6, 2020

PhilipMay commented Aug 7, 2020

tholor commented Aug 8, 2020

PhilipMay commented Aug 8, 2020

bogdankostic commented Aug 18, 2020

PhilipMay commented Aug 25, 2020

bogdankostic commented Aug 31, 2020

bogdankostic commented Sep 1, 2020

Timoeller commented Sep 1, 2020

bogdankostic commented Sep 1, 2020

tholor left a comment

PhilipMay commented Sep 2, 2020

Add option to use fast HF tokenizer. #482

Add option to use fast HF tokenizer. #482

Conversation

PhilipMay commented Aug 1, 2020 • edited

To-do before done

PhilipMay commented Aug 1, 2020 • edited

tholor commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

tholor commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 2, 2020

tholor left a comment

Choose a reason for hiding this comment

PhilipMay commented Aug 3, 2020

tholor commented Aug 4, 2020

tholor commented Aug 5, 2020

PhilipMay commented Aug 5, 2020

tholor commented Aug 6, 2020

PhilipMay commented Aug 7, 2020

tholor commented Aug 8, 2020

PhilipMay commented Aug 8, 2020

bogdankostic commented Aug 18, 2020

PhilipMay commented Aug 25, 2020

bogdankostic commented Aug 31, 2020

bogdankostic commented Sep 1, 2020

Timoeller commented Sep 1, 2020

bogdankostic commented Sep 1, 2020

tholor left a comment

Choose a reason for hiding this comment

PhilipMay commented Sep 2, 2020

PhilipMay commented Aug 1, 2020 •

edited

PhilipMay commented Aug 1, 2020 •

edited