In the benchmark studies, how are the draft tokens generated? #9

jivanph · 2024-01-24T10:43:09Z

I read with great interest your paper 'Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy'.

In essence, the paper proposes a tree data structure to verify proposed draft tokens, and in this way speed up inference.

Unfortunately, it's not clear to me from the paper how these draft tokens were generated when establishing benchmark results for LookAhead-Parallel and LookAhead-Hierarchical.

I understand the focus on the paper is on how to handle a set of draft tokens (perhaps as a single branch, perhaps in parallel, or perhaps in a hierarchical manner). But the origin of the draft tokens in the benchmark results remains unclear to me.

jivanph · 2024-01-24T11:20:37Z

A related question in regards to the benchmark studies, what was the sampling mechanism used to accept tokens? Was it greedy sampling?

zheyishine · 2024-01-26T12:52:29Z

To Q1: The draft tokens are generated from a cached trie tree (each node is a token id). Currently the trie tree is constructed from prompts and responses on-the-fly, therefore it is friendly for deployment (neither additonal assist models nor head training), and works pretty well in real-life scenarios. A bit more, we also have probed Jaccobi iteraction to yield addtional drafts and will integrate into the repo soon(even though its speedup is marginal).
To Q2: Yes, we use the greedy strategy in our benchmarks.

jivanph · 2024-01-26T13:07:13Z

Thank you for your responses. I understand the tokens are passed as a tree object. But my question is, how did you choose which tokens to use in the benchmark? What do you mean by "responses on-the-fly"?

zheyishine · 2024-01-27T14:09:50Z

We choose tokens not only from responses, but also from prompts.

jivanph · 2024-01-30T13:28:44Z

Could you explain a little bit further how you choose the tokens?

zheyishine · 2024-02-01T04:05:46Z

In the benchmark, we first generate responses for samples from dev set, and put the responses into a global trie tree, then we evaluate each prompt in the test set( all the samples are different from the ones from the dev set). For each query in the test set, we first put the query into a global trie tree , and the generate tokens are also put into the trie tree on-the-fly. The tokens in the global trie tree have a chance to be chosen in the following queries.

jivanph · 2024-02-13T10:41:05Z

Thank you for your response. Could you please point me which part of the code is in charge of the verification of the trees proposed?

My main question is, how to verify that the output of regular decoding with sampling (instead of greedy decoding) coincides with PAIN decoding with sampling. How can we tell that responses made from PAIN are the same as responses obtained from regular decoding (under sampling).

zheyishine · 2024-02-19T03:33:31Z

Lines from here to here are used for vefification of tree drafts.

Our lookahead decoding can not generate exactly the same response as the generation mode SAMPLE in transformers, due to random sampling( i.e., caused by torch.multinomial). We guarantee the same distribution by following the decoding steps of generation modeSAMPLE. Our implement is aligned with the generation mode ASSISTED_GENERATION in transformers.
Details can be found with lines from here to here.

fuerpy · 2024-03-06T09:51:56Z

The draft tokens are generated from a cached trie tree (each node is a token id). Currently the trie tree is constructed from prompts and responses on-the-fly, therefore it is friendly for deployment (neither additonal assist models nor head training), and works pretty well in real-life scenarios. A bit more, we also have probed Jaccobi iteraction to yield addtional drafts and will integrate into the repo soon(even though its speedup is marginal).

I'm also confused about how draft tokens were created.
Do you mean that this draft tokens is generated from previous prompts records? And not generated by model sampling is it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In the benchmark studies, how are the draft tokens generated? #9

In the benchmark studies, how are the draft tokens generated? #9

jivanph commented Jan 24, 2024

jivanph commented Jan 24, 2024

zheyishine commented Jan 26, 2024

jivanph commented Jan 26, 2024

zheyishine commented Jan 27, 2024

jivanph commented Jan 30, 2024

zheyishine commented Feb 1, 2024 •

edited

Loading

jivanph commented Feb 13, 2024 •

edited

Loading

zheyishine commented Feb 19, 2024

fuerpy commented Mar 6, 2024

In the benchmark studies, how are the draft tokens generated? #9

In the benchmark studies, how are the draft tokens generated? #9

Comments

jivanph commented Jan 24, 2024

jivanph commented Jan 24, 2024

zheyishine commented Jan 26, 2024

jivanph commented Jan 26, 2024

zheyishine commented Jan 27, 2024

jivanph commented Jan 30, 2024

zheyishine commented Feb 1, 2024 • edited Loading

jivanph commented Feb 13, 2024 • edited Loading

zheyishine commented Feb 19, 2024

fuerpy commented Mar 6, 2024

zheyishine commented Feb 1, 2024 •

edited

Loading

jivanph commented Feb 13, 2024 •

edited

Loading