Experiment: append current KV last #10

gante · 2023-06-25T18:34:27Z

⚠️ do not merge!

This is an experimental PR that rewrites the generation code to mimic what we do in XLA JAX/TF.

It allows a fixed cache size and avoids dynamic slicing before torch.nn.functional.scaled_dot_product_attention, allowing compilation with fullgraph=True. The trick consists in keeping the prefill step with dynamic shapes (were dynamic slicing is not needed), and all subsequent steps will append the latest KV values in the last position in the fixed-shaped tensors. Between each generation step, the values are moved into the right position, so we can keep appending new values at the end.

Learnings

PT version: torch==2.1.0.dev20230621+cu118

We can compile this version with fullgraph=True, keeping the correctness of the results;
It is slower than the existing main, especially when compiled: ~3% slower without compilation, ~10% slower with compilation;
Causes for the slowdown vs main, as seen in the profiler (comparing python scripts/run_llama.py --model huggingface/llama-7b --preallocate --profile runs):
- there is extra logic between each generation step, accounting for extra >1ms/token
- scaled_dot_product_attention, because it now needs the attention mask, takes extra >100us/layer, or >3ms/token

fxmarty · 2023-06-26T02:20:38Z

scaled_dot_product_attention, because it now needs the attention mask, takes extra >100us/layer, or >3ms/token

Maybe you already know but AFAIK SDPA does not currently dispatch to flash attention/mem efficient attention when an attention_mask is passed, see pytorch/pytorch#96099 (comment) & https://huggingface.slack.com/archives/C046RST834Y/p1679584564826879, so that may be the reason for slowdowns.

allowing compilation with fullgraph=True

No more issues with the VariableBuilder.init() error?

gante · 2023-06-27T10:08:59Z

No more issues with the VariableBuilder.init() error?

Nope (and I didn't even change the PT version 👀 )

fxmarty · 2023-06-27T13:56:43Z

@gante (I am talking not about an actual error, but the error that shows in logging.DEBUG mode only)

gante · 2023-06-27T21:20:50Z

@fxmarty I don't see it (but maybe I'm not looking in the right place)

gante · 2023-06-27T21:57:45Z

Using the plot facilities from #12 (and using the plots in that PR as a reference for the performance in main)

batch size sweep

prompt length sweep

performance conclusions

For batch size = 1 it's roughly on par with main
For batch size > 1 it's slower than main
This PR has lower memory consumption, especially when compiled (with batch size=4, 16223MB vs 16895MB on main)

fxmarty · 2023-06-28T01:18:33Z

Interesting, great work! I assume for batch size > 1 we may get more and more compute bound and so having a huge empty KV cache used for actual compute is not that great?

fxmarty · 2023-06-28T01:34:43Z

@fxmarty I don't see it (but maybe I'm not looking in the right place)

I have the error log using

import logging
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, inductor=logging.DEBUG)

and inspecting logs from python run_llama.py --model huggingface/llama-7b --preallocate --compile static &> compile.log

I have a TypeError: VariableBuilder.__init__() got an unexpected keyword argument 'guards' in the logs.

gante · 2023-06-28T09:24:54Z

I assume for batch size > 1 we may get more and more compute bound and so having a huge empty KV cache used for actual compute is not that great?

My thoughts as well. Dynamic slicing (as in main) forces us out of graph mode, but there are benefits. Perhaps this difference will diminish as PT works to get a better compiler 🤔

gante added 3 commits June 25, 2023 15:58

mvp (has numerical issues

e11f7fa

correct indexing

2439481

inductor fix

c999ddb

Merge branch 'main' into append_kv_last_position

58bb48d

gante mentioned this pull request Aug 16, 2023

Refactor Pytorch model.generate method to work on TPU huggingface/transformers#18661

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment: append current KV last #10

Experiment: append current KV last #10

Uh oh!

gante commented Jun 25, 2023 •

edited

Loading

Uh oh!

fxmarty commented Jun 26, 2023 •

edited

Loading

Uh oh!

gante commented Jun 27, 2023

Uh oh!

fxmarty commented Jun 27, 2023

Uh oh!

gante commented Jun 27, 2023

Uh oh!

gante commented Jun 27, 2023

Uh oh!

fxmarty commented Jun 28, 2023 •

edited

Loading

Uh oh!

fxmarty commented Jun 28, 2023

Uh oh!

gante commented Jun 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Experiment: append current KV last #10

Are you sure you want to change the base?

Experiment: append current KV last #10

Uh oh!

Conversation

gante commented Jun 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ do not merge!

Learnings

Uh oh!

fxmarty commented Jun 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Jun 27, 2023

Uh oh!

fxmarty commented Jun 27, 2023

Uh oh!

gante commented Jun 27, 2023

Uh oh!

gante commented Jun 27, 2023

batch size sweep

prompt length sweep

performance conclusions

Uh oh!

fxmarty commented Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty commented Jun 28, 2023

Uh oh!

gante commented Jun 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gante commented Jun 25, 2023 •

edited

Loading

fxmarty commented Jun 26, 2023 •

edited

Loading

fxmarty commented Jun 28, 2023 •

edited

Loading