Slow Tokenizer if custom vocab was added #157

tholor · 2019-11-20T17:34:10Z

Describe the bug
Tokenizer becomes very slow with large custom vocab.

Additional context
This was introduced after switching to the Tokenizers from the transformers repo

There are related issues reported in the transformers repo:

To Reproduce

Add custom vocab to tokenizer via tokenizer.add_tokens()
Load some data into the data silo, e.g. run examples/lm_finetuning.py

System:

OS: Ubuntu 18.04
GPU/CPU: Both
FARM version: master @ 484d26c

The text was updated successfully, but these errors were encountered:

tholor · 2020-01-06T10:33:52Z

There are new fast tokenizers for BERT implemented in rust: huggingface/transformers#2211

We should have a look if they are compatible and solve the issue with custom vocab here.

julien-c · 2020-01-13T17:13:05Z

Yes, please check it out and let us know here.

Repo is at https://github.com/huggingface/tokenizers

tholor · 2020-01-22T08:31:53Z

Hey @julien-c,

I finally found some time to test. Seems really promising! Great work!

✔️ Same tokenization behaviour as BertTokenizer (see test)
✔️ Speed: ~ 7.8 x faster! (Tested via tokenizing SQuAD train set with 42 Mio chars)
✅ Speed remains the same with custom vocab < 300. Somehow it's about 4x slower for custom vocab = 400 (using add_tokens())

The only blocker for us right now:
❌ The Tokenizer objects can't be pickled and are therefore not usable with python's multiprocessing. As we make heavy use of multiprocessing during preprocessing, we can't really use them right now. Seems that others have a similar issue. Not sure how much of work is needed for fixing this, but for the XLM-R python tokenizer it was a very easy fix.

julien-c · 2020-01-22T14:57:09Z

Hi @tholor, mind opening an issue on tokenizers too, cross-referencing this one? cc @n1t0

salmanmashayekh · 2020-02-27T21:47:10Z

In the add_tokens method, why don't we simply integrate new_tokens into the self.vocab? We are using the following CustomVocabBertTokenizer and it does not slow down when new_tokens are added:

from transformers import BertTokenizer, WordpieceTokenizer
from collections import OrderedDict


class CustomVocabBertTokenizer(BertTokenizer):
    def add_tokens(self, new_tokens):
        new_tokens = [token for token in tokens if not (token in self.vocab or token in self.all_special_tokens)]

        self.vocab = OrderedDict([
            *self.vocab.items(),
            *[
                (token, i + len(self.vocab))
                for i, token in enumerate(new_tokens)
            ]
        ])

        self.ids_to_tokens = OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)

        return len(new_tokens)

tholor · 2020-03-09T13:56:25Z

Hi @salmanmashayekh, sorry for the late reply. I haven't seen your comment until now.
This seems like a simple, scalable workaround. However, I am not sure if it has any unintended side effects (e.g. on saving/loading) in Transformers. Have you investigated the behavior in Transformers?
It could make sense to raise a PR there, since this is something useful to everybody and not only FARM users.

stale · 2020-06-06T01:44:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

davidnarganes · 2021-02-02T13:33:39Z

Hi,
I've got the same issue. The tokenizer is super slow when adding new tokens even with the Fast class:

from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

# Maybe this url for the files:
# https://huggingface.co/transformers/v3.1.0/_modules/transformers/tokenization_gpt2.html
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"

# They have to be sorted in reverse by length, otherwise the tokens arent 
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = ["new_" + str(x) for x in newtokens]

# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])

tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

# Add new vocab
# https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
# https://github.com/deepset-ai/FARM/issues/157
for k in tokenizers:
    if "custom" in k:
        print(k)
        print("Vocab length before:", len(tokenizers[k].get_vocab()))
        tokenizers[k].add_tokens(newtokens)
        print("Vocab length after:", len(tokenizers[k].get_vocab()))

# creating the configurations from which the model can be made
config = GPT2Config(
  vocab_size=len(tokenizer),
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id
)

# creating the model
# https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html
model = TFGPT2LMHeadModel(config)

# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
    print(k, v.tokenize(text))

and then profiling the speed in jupyter:

for k in tokenizers:
    print(k)
    %timeit tokenizers[k].tokenize(text)

Timoeller · 2021-02-02T13:58:25Z

Hey, I guess this is currently a huggingface transformers issue, since FARM does not support GPT2 as of now.

I also do not think it is related to FARM, since our implementation of tokenizers just calls underlying HF transformers implementations.

tholor added the bug Something isn't working label Nov 20, 2019

tholor mentioned this issue Jan 22, 2020

WIP Add fast rust tokenizers #205

Closed

tholor mentioned this issue Jan 22, 2020

Missing serialization preventing multiprocessing huggingface/tokenizers#98

Closed

stale bot added the stale label Jun 6, 2020

stale bot closed this as completed Jun 20, 2020

PhilipMay mentioned this issue Aug 1, 2020

Add option to use fast HF tokenizer. #482

Merged

4 tasks

davidnarganes mentioned this issue Feb 2, 2021

Tokenizer is slow when adding new tokens #703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Tokenizer if custom vocab was added #157

Slow Tokenizer if custom vocab was added #157

tholor commented Nov 20, 2019

tholor commented Jan 6, 2020

julien-c commented Jan 13, 2020

tholor commented Jan 22, 2020 •

edited

Loading

julien-c commented Jan 22, 2020

salmanmashayekh commented Feb 27, 2020

tholor commented Mar 9, 2020

stale bot commented Jun 6, 2020

davidnarganes commented Feb 2, 2021 •

edited

Loading

Timoeller commented Feb 2, 2021

Slow Tokenizer if custom vocab was added #157

Slow Tokenizer if custom vocab was added #157

Comments

tholor commented Nov 20, 2019

tholor commented Jan 6, 2020

julien-c commented Jan 13, 2020

tholor commented Jan 22, 2020 • edited Loading

julien-c commented Jan 22, 2020

salmanmashayekh commented Feb 27, 2020

tholor commented Mar 9, 2020

stale bot commented Jun 6, 2020

davidnarganes commented Feb 2, 2021 • edited Loading

Timoeller commented Feb 2, 2021

tholor commented Jan 22, 2020 •

edited

Loading

davidnarganes commented Feb 2, 2021 •

edited

Loading