Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar sampler implementation causes non-trivial token speed degradation #3980

Closed
kalomaze opened this issue Nov 7, 2023 · 5 comments
Closed

Comments

@kalomaze
Copy link
Contributor

kalomaze commented Nov 7, 2023

The copy function of the Grammar sampler specifically is O(n^2) in time complexity.
image

On 13b 4_K_M, with all layers fully offloaded to my GPU (RTX 3060 12GB VRAM), I normally get a token speed of ~16T/s. This degrades to ~10T/s with grammar sampling on, regardless of the complexity of the grammar being used.

I'm not sure if the sampler code is being threaded at the moment, or if that would help, but hopefully the Grammar implementation could be refactored in some way to accomodate for this.

I'm not sure if it's running through the entire list of 32,000 logits. Maybe it would be smart to run the grammar sampler only after truncation samplers (Top K, Min P...)? If this time complexity is inherently necessary.

@jmikedupont2
Copy link

I can confirm a huge slowdown

@ejones
Copy link
Collaborator

ejones commented Nov 13, 2023

So overall, the initial grammar implementation was not optimized in any way in particular, so there are certainly optimization opportunities. Also I do know of certain pathological grammars that seem to cause large slowdowns or hanging, I haven't looked into that yet. I will say though that:

  • llama_grammar_copy only ends up being used in the speculative example AFAIK, not part of the core grammar sampling in main or server
  • as a baseline, I generally see ~5ms / token sampling times for e..g JSON grammars, on the M2 Max at least. This is much lower than the ~37ms in your experience, if I did the math right there.

@kalomaze
Copy link
Contributor Author

kalomaze commented Nov 16, 2023

So overall, the initial grammar implementation was not optimized in any way in particular, so there are certainly optimization opportunities. Also I do know of certain pathological grammars that seem to cause large slowdowns or hanging, I haven't looked into that yet. I will say though that:

  • llama_grammar_copy only ends up being used in the speculative example AFAIK, not part of the core grammar sampling in main or server
  • as a baseline, I generally see ~5ms / token sampling times for e..g JSON grammars, on the M2 Max at least. This is much lower than the ~37ms in your experience, if I did the math right there.

With the assumption that sampler code runs on CPU rather than GPU, I imagine that a M2 Mac does better than a i7 CPU from 2016, lol

@ExtReMLapin
Copy link
Contributor

Something like that should helps a bit, no ?

struct llama_grammar * llama_grammar_copy(const struct llama_grammar * grammar) {
    llama_grammar * result = new llama_grammar{ grammar->rules, grammar->stacks, grammar->partial_utf8 };

    std::unordered_map<const llama_grammar_element*, const llama_grammar_element*> pointer_map;

    // create a map of old pointers to new pointers
    for (size_t ir0 = 0; ir0 < grammar->rules.size(); ir0++) {
        for (size_t ir1 = 0; ir1 < grammar->rules[ir0].size(); ir1++) {
            pointer_map[&grammar->rules[ir0][ir1]] = &result->rules[ir0][ir1];
        }
    }

    // redirect elements in stacks to point to new rules
    for (size_t is = 0; is < result->stacks.size(); is++) {
        for (size_t ie = 0; ie < result->stacks[is].size(); ie++) {
            result->stacks[is][ie] = pointer_map[grammar->stacks[is][ie]];
        }
    }

    return result;
}

Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants