Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration into 馃 Transformers.js (+ generalization/minor improvements) #9

Closed
xenova opened this issue Aug 5, 2023 · 5 comments

Comments

@xenova
Copy link

xenova commented Aug 5, 2023

First of all, I just wanted to thank you for this implementation! It is significantly faster than my original version of the Llama tokenizer (which would often just break for extremely large input texts). I've decided to generalize your implementation so that it can be applied to other BPE-based tokenizers (e.g., GPT-4). Here's my current PR for it to be added to Transformers.js.

Here's an example playground for some other tokenizers: demo.

I also made a few fixes/improvements to your version:

  1. Caching system (only noticeable for non-Llama tokenizers; up to 2x speedup for GPT-4).
  2. Instead of your x * 100000 + leftNode.origPos "trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.
  3. I modified the algorithm to operate with tokens (instead of token ids), mainly to allow interoperability with the current implementation.
  4. top not defined in
    this._heap[top] = value;
  5. Documentation improvements

Feel free to use any of this in your version too!

@belladoreai
Copy link
Owner

belladoreai commented Aug 7, 2023

Thanks for the kind words and good job with the adaptation! I'm happy to hear that my code is (sort of) going into Transformers.js.

One major difference in your adaptation is that I'm running the byte-pair encoding only once for the entire input string, whereas you are running it multiple times (once per whitespace-separated word if I'm not mistaken, please correct me if I'm wrong). I'm referring to this in your PR where you call this.bpe multiple times:

    /**
     * Encodes the input sequence of tokens using the BPE algorithm and returns the resulting subword tokens.
     * @param {string[]} tokens The input sequence of tokens to encode.
     * @returns {string[]} The resulting subword tokens after applying the BPE algorithm to the input sequence of tokens.
     */
    encode(tokens) {
        let outputTokens = [];

        for (let token of tokens) {
            let bpe_token_list = this.bpe(token);
            ...

Have you checked if this change has any performance implications? The performance impact here is probably negligible, but I would check anyway. Your version might be even faster.

I was also wondering if this is guaranteed to produce the same results? I believe it's guaranteed to produce the same results for all tokenization schemes which never have whitespace in the middle of the token (e.g. "_grabbed" has whitespace in the beginning, whereas "grab_bed" would have whitespace in the middle).

Instead of your x * 100000 + leftNode.origPos "trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.

Thanks for spotting this issue. When I wrote this code I must have thought "nobody ever needs to tokenize more than 100 000 characters at the time", but it only took a few weeks for all those superhot models to come out, and here we are. Note that your fractional divider is word length, which works great, because words aren't very large (at least not any real words in any real language)... but if I used your fix directly in my version, my divider would be the entire input string length. For example, if the input string was 1 000 000 characters long, then my divider would be 1 000 000. I'm not sure at what point we get numerical stability issues when we multiply / divide large numbers. I adapted your fix to v1.1.3 release now in a bit hacky way, where I tried to avoid numerical instability issues like this:

const mergePrio = llamaTokenizer.merges.get(mergeIdentifierString) * 100000 + leftNode.origPos / 100000

This should support inputs up to 10000000000 length without numerical stability issues (and inputs at that length will be too large to process anyway, so I think this should be good enough for all real inputs).

top not defined

Thanks for pointing this out, fixed in v1.1.3 now.

@belladoreai
Copy link
Owner

By the way, thanks for crediting me in the code comments at llama tests and PriorityQueue! Could I ask you to add a similar code comment credit to the bpe function as well? I would love to get internet points for that :D

@xenova
Copy link
Author

xenova commented Aug 8, 2023

One major difference in your adaptation is that I'm running the byte-pair encoding only once for the entire input string, whereas you are running it multiple times (once per whitespace-separated word if I'm not mistaken, please correct me if I'm wrong).

So, this depends on the tokenizer: more specifically, the (1) pretokenization and (2) splitting on special characters. For example:

  1. GPT-4 (BPE-based) uses a ByteLevel pretokenizer (see tokenizer.json):

      "pre_tokenizer": {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": true
      },

    If no regex is specified (like above), it will use the default:
    /'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/gu (which splits on whitespace and punctuation).

  2. Llama (also BPE-based) does not use a pretokenizer (see tokenizer.json):

    "pre_tokenizer": null,

And it is these "pretokenized" texts which each run through the bpe function.

Have you checked if this change has any performance implications? The performance impact here is probably negligible, but I would check anyway. Your version might be even faster.

Yeah, I actually did do quite a bit of performance testing! Although the GPT-4 and Llama tokenizers have many differences (e.g., vocab size; GPT-4=100277, Llama2=32000), GPT-4's tokenization is much faster than Llama (only noticeable with longer pieces of text). An even more important benefit of pretokenization is that it's much more likely for smaller words/texts to be reused (meaning we can implement a very basic caching system). As an example, to tokenize 500k characters, it takes GPT-4 ~350ms vs. Llama at ~800ms (even with the >3x larger vocabulary/merge lists).

TLDR: pretokenization helps a lot for larger texts, but the difference isn't noticeable for smaller texts!

I was also wondering if this is guaranteed to produce the same results? I believe it's guaranteed to produce the same results for all tokenization schemes which never have whitespace in the middle of the token (e.g. "_grabbed" has whitespace in the beginning, whereas "grab_bed" would have whitespace in the middle).

I've written an extensive set of unit tests, which only pass if they exactly match the python implementation, so I believe the implementation does operate correctly (there is also no splitting on whitespace, as mentioned above)

When I wrote this code I must have thought "nobody ever needs to tokenize more than 100 000 characters at the time", but it only took a few weeks for all those superhot models to come out, and here we are.

Exactly! I doubt this problem will ever cause any issue in reality, but I just prefer it over a "magic number" approach. The "divide by word length" should hopefully future-proof it.

I'm not sure at what point we get numerical stability issues when we multiply / divide large numbers.

Good point to consider. Though, I think that since the denominator is the same in all cases (and it's only the numerator which changes), I don't think there will be any possible situation where this condition fails.馃槄

Could I ask you to add a similar code comment credit to the bpe function as well? I would love to get internet points for that :D

Absolutely! 馃 Sorry - I forgot to update the JSDoc description for it.

@belladoreai
Copy link
Owner

Ah, very nice explanation.

So the GPT-4 tokenizer uses a pretokenizer with a regex that splits on whitespace and punctuation, leading to a situation where you end up calling bpe on individual words, which is more performant than calling bpe on the entire input string, and this also allows you to set up simple caching for additional performance gain. And the LLaMa tokenizer doesn't use a pretokenizer, so your implementation ends up calling bpe on the entire input string, same as in my implementation.

You're probably right that numerical stability issues won't lead to incorrect results when dividing both numbers by the same denominator. I simplified the mergePrio calculation now, using the same fraction as you did in your implementation.

@xenova
Copy link
Author

xenova commented Aug 8, 2023

So the GPT-4 tokenizer uses a pretokenizer with a regex that splits on whitespace and punctuation, leading to a situation where you end up calling bpe on individual words, which is more performant than calling bpe on the entire input string, and this also allows you to set up simple caching for additional performance gain. And the LLaMa tokenizer doesn't use a pretokenizer, so your implementation ends up calling bpe on the entire input string, same as in my implementation.

Exactly! 馃敟

You're probably right that numerical stability issues won't lead to incorrect results when dividing both numbers by the same denominator. I simplified the mergePrio calculation now, using the same fraction as you did in your implementation.

I simplified the mergePrio calculation now, using the same fraction as you did in your implementation.

Looks good 馃憤

Thanks again for your amazing work! I'll close the issue since I think everything's resolved now 馃槆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants