Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Tokenizer if custom vocab was added #157

Closed
tholor opened this issue Nov 20, 2019 · 9 comments
Closed

Slow Tokenizer if custom vocab was added #157

tholor opened this issue Nov 20, 2019 · 9 comments
Labels
bug Something isn't working stale

Comments

@tholor
Copy link
Member

tholor commented Nov 20, 2019

Describe the bug
Tokenizer becomes very slow with large custom vocab.

Additional context
This was introduced after switching to the Tokenizers from the transformers repo

There are related issues reported in the transformers repo:

To Reproduce

  • Add custom vocab to tokenizer via tokenizer.add_tokens()
  • Load some data into the data silo, e.g. run examples/lm_finetuning.py

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Both
  • FARM version: master @ 484d26c
@tholor tholor added the bug Something isn't working label Nov 20, 2019
@tholor
Copy link
Member Author

tholor commented Jan 6, 2020

There are new fast tokenizers for BERT implemented in rust: huggingface/transformers#2211

We should have a look if they are compatible and solve the issue with custom vocab here.

@julien-c
Copy link
Contributor

Yes, please check it out and let us know here.

Repo is at https://github.com/huggingface/tokenizers

@tholor
Copy link
Member Author

tholor commented Jan 22, 2020

Hey @julien-c,

I finally found some time to test. Seems really promising! Great work!

✔️ Same tokenization behaviour as BertTokenizer (see test)
✔️ Speed: ~ 7.8 x faster! (Tested via tokenizing SQuAD train set with 42 Mio chars)
✅ Speed remains the same with custom vocab < 300. Somehow it's about 4x slower for custom vocab = 400 (using add_tokens())

The only blocker for us right now:
❌ The Tokenizer objects can't be pickled and are therefore not usable with python's multiprocessing. As we make heavy use of multiprocessing during preprocessing, we can't really use them right now. Seems that others have a similar issue. Not sure how much of work is needed for fixing this, but for the XLM-R python tokenizer it was a very easy fix.

@julien-c
Copy link
Contributor

Hi @tholor, mind opening an issue on tokenizers too, cross-referencing this one? cc @n1t0

@salmanmashayekh
Copy link

In the add_tokens method, why don't we simply integrate new_tokens into the self.vocab? We are using the following CustomVocabBertTokenizer and it does not slow down when new_tokens are added:

from transformers import BertTokenizer, WordpieceTokenizer
from collections import OrderedDict


class CustomVocabBertTokenizer(BertTokenizer):
    def add_tokens(self, new_tokens):
        new_tokens = [token for token in tokens if not (token in self.vocab or token in self.all_special_tokens)]

        self.vocab = OrderedDict([
            *self.vocab.items(),
            *[
                (token, i + len(self.vocab))
                for i, token in enumerate(new_tokens)
            ]
        ])

        self.ids_to_tokens = OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)

        return len(new_tokens)

@tholor
Copy link
Member Author

tholor commented Mar 9, 2020

Hi @salmanmashayekh, sorry for the late reply. I haven't seen your comment until now.
This seems like a simple, scalable workaround. However, I am not sure if it has any unintended side effects (e.g. on saving/loading) in Transformers. Have you investigated the behavior in Transformers?
It could make sense to raise a PR there, since this is something useful to everybody and not only FARM users.

@stale
Copy link

stale bot commented Jun 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

@stale stale bot added the stale label Jun 6, 2020
@stale stale bot closed this as completed Jun 20, 2020
@davidnarganes
Copy link

davidnarganes commented Feb 2, 2021

Hi,
I've got the same issue. The tokenizer is super slow when adding new tokens even with the Fast class:

from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

# Maybe this url for the files:
# https://huggingface.co/transformers/v3.1.0/_modules/transformers/tokenization_gpt2.html
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"

# They have to be sorted in reverse by length, otherwise the tokens arent 
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = ["new_" + str(x) for x in newtokens]

# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])

tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

# Add new vocab
# https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
# https://github.com/deepset-ai/FARM/issues/157
for k in tokenizers:
    if "custom" in k:
        print(k)
        print("Vocab length before:", len(tokenizers[k].get_vocab()))
        tokenizers[k].add_tokens(newtokens)
        print("Vocab length after:", len(tokenizers[k].get_vocab()))

# creating the configurations from which the model can be made
config = GPT2Config(
  vocab_size=len(tokenizer),
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id
)

# creating the model
# https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html
model = TFGPT2LMHeadModel(config)

# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
    print(k, v.tokenize(text))

and then profiling the speed in jupyter:

for k in tokenizers:
    print(k)
    %timeit tokenizers[k].tokenize(text)

@Timoeller
Copy link
Contributor

Hey, I guess this is currently a huggingface transformers issue, since FARM does not support GPT2 as of now.

I also do not think it is related to FARM, since our implementation of tokenizers just calls underlying HF transformers implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants