Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use fast HF tokenizer. #482

Merged
merged 32 commits into from Sep 2, 2020

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented Aug 1, 2020

This PR adds the option to use fast HF tokenizer.

The reason why I need this is the following:
I plan to open source a German Electra model which is lower case but does not strip accents. To do that you have to specify strip_accents=False to the tokenizer. But this option is only available for the fast tokenizer. Also see here: huggingface/transformers#6186 and here google-research/electra#88

Might also solve #157

To-do before done

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Aug 1, 2020

I just saw that there is a very similar PR here: #205
My suggestion / offer: I can continue with this PR and will try to merge the stuff from #205

@tholor @Timoeller do you want to have fast set by default? HF not having it as default in tokenizers.AutoTokenizer.
My suggestion is: Do not use fast tokenizer as default because it might be a breaking change.

@tholor
Copy link
Member

tholor commented Aug 1, 2020

Hey @PhilipMay, would be great if you can continue with the integration of the fast rust tokenizers. We were actually blocked in #205 for quite some time due to missing support for multiprocessing, but we should be good to go now.
We see two stages of integration:

  1. arg to optionally enable fast tokenizer
  2. hopefully switching to it completely soon and getting rid of FARMs tokenize_with_metadata(). It's main purpose was to create the offset mapping which is crucial for aligning token level predictions back to original string space (e.g for QA). This is now also available in the rust tokenizers. We should get some additional speed improvements from this full switch as we avoid tokenizer calls in the word level.

For 2) we would need to understand in more detail if there were any breaking changes (e.g are all model types by now supported?) or tokenization differences (e.g Roberta's prefix whitespace)

We really appreciate any support from you here, but can also take over at any point (especially for 2.)

BTW: Great that you'll open source a German Electra model! We have trained some variants there too and found them to be very effective. Will publish them soon together with a paper.

@PhilipMay
Copy link
Contributor Author

Hey @tholor - thanks for the answer. Is it ok for you to separate point 1 and 2 from each other in different PRs? I think they can be handled in steps.

@tholor
Copy link
Member

tholor commented Aug 1, 2020

Yes, absolutely!
I believe 1) will help us to understand if 2) is feasible :)

@PhilipMay
Copy link
Contributor Author

Yes, absolutely!
I believe 1) will help us to understand if 2) is feasible :)

Great - so I will continue with this... :-)

@PhilipMay
Copy link
Contributor Author

@tholor Is it ok with you to set "slow" as default? So we have no breaking change?

- set num_processes=0 for Inferencer
@PhilipMay
Copy link
Contributor Author

If "slow" tokenizer as default is ok with you - this PR can be merged from my point of view.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Thanks for working on this! Two small suggestions from my side...

test/test_tokenization.py Outdated Show resolved Hide resolved
farm/infer.py Show resolved Hide resolved
@PhilipMay
Copy link
Contributor Author

@tholor When doing "normal" text classification witha fast tokenizer like in https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification.py I have the following problem:

08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Basket id: id_internal: 4879, id_external: None
08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Error message: TextInputSequence must be str
08/03/2020 20:42:02 - ERROR - farm.data_handler.processor -   Could not convert this sample to features: 

The is coming from here: https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L299

Do you have an idea?

@tholor
Copy link
Member

tholor commented Aug 4, 2020

Hmm...how does the sample look like that causes trouble here? Is it printed after the above error message? If not, maybe set a breakpoint there.
Possibly the text is 'None'?

Is that happening for all samples or only some in your dataset?

@tholor
Copy link
Member

tholor commented Aug 5, 2020

Sure. I am currently on vacation with only partial internet access, but I will try to have a look at it tomorrow.

@PhilipMay
Copy link
Contributor Author

Sure. I am currently on vacation with only partial internet access, but I will try to have a look at it tomorrow.

That would be great! Thanks. ;-)

@tholor
Copy link
Member

tholor commented Aug 6, 2020

I investigated the problem with is_pretokenized. It seems to be an issue upstream with subword tokens for the slow tokenizers: huggingface/transformers#6046
I added a quick fix. However, CI is still not passing. Could you please check if that's something you can tackle?
If not, I am happy to dig in deeper again, but this might take some more time as I am in nature with quite limited internet bandwidth and many tests are a hassle due to the relatively large model downloads.
The next failing test (test_embeddings_extraction() ) seems to be about slightly different vector values. If the diff is negligible, we could relax the assert there a bit.

@PhilipMay
Copy link
Contributor Author

There seem to be more bugs with fast and slow tokenizers. To be honest I do not feel like debugging all this stuff at the moment...
Will write a comment to the upstream issue though.

@tholor
Copy link
Member

tholor commented Aug 8, 2020

Ok sure. We will take over.
As mentioned above this can take a while though.

@PhilipMay
Copy link
Contributor Author

Well, thanks! :-)

@bogdankostic
Copy link
Contributor

Fast tokenizers seem to have the same problem with is_pretokenized. There doesn't seem to be a quick fix as is the case for slow tokenizers. I have opened an issue in transformers.

@PhilipMay
Copy link
Contributor Author

Did you see this commen huggingface/transformers#6046 (comment) ?

@bogdankostic
Copy link
Contributor

Yes, we saw that. Currently trying to come up with a solution for this.

@bogdankostic
Copy link
Contributor

A quick update on what I have done so far, @PhilipMay.
I solved the problem about not being able to pass already tokenized sequences to FastTokenizer by passing the raw text. However, this comes with the downside of having to tokenize twice, once when initializing the samples to get the offsets and once when extracting the input ids to featurize the samples, making the use of a FastTokenizer maybe pointless.
I chose this approach to not introduce any major changes.

One question that I came up with concerns line 107 in tokenizytion.py. Why is an error raised if the user wants to use a FastTokenizer with a Roberta model? According to Huggingface documentation, fast tokenizers are available for Roberta models. Do they behave differently than the other tokenizers?

@Timoeller
Copy link
Contributor

Why is an error raised if the user wants to use a FastTokenizer with a Roberta model?

My 1.5 cents:
I am more wondering how a XLM-R model can be used with Rust tokenizers, since the tokenizer is a sentencepiece model.
Maybe this use_fast check for RoBERTa is an artifact?

@bogdankostic
Copy link
Contributor

This PR, as it is now, makes it possible to use fast tokenizers with Bert models, Distilbert models and Electra models. Although fast tokenizers exist for Roberta models, fast Roberta tokenizers are not supported yet. This is because to get the input-IDs, for certain tasks we need to detokenize the input. However, Roberta treats spaces like parts of the tokens which makes detokenizing more complex.

In a future PR we should implement fast tokenizers such that we get the samples' features while tokenizing and extracting the offsets. Like this, we would save one tokenizer step and could use fast tokenizers with Roberta models.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

@PhilipMay
Copy link
Contributor Author

One question that I came up with concerns line 107 in tokenizytion.py. Why is an error raised if the user wants to use a FastTokenizer with a Roberta model?

[...] Although fast tokenizers exist for Roberta models, fast Roberta tokenizers are not supported yet. This is because to get the input-IDs, for certain tasks we need to detokenize the input. However, Roberta treats spaces like parts of the tokens which makes detokenizing more complex.

I think that was the reason why I did that on my early PR commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants