Skip to content

[Unity] Add an API to create multiple kv caches with single allocation#15064

Merged
tqchen merged 2 commits intoapache:unityfrom
yelite:kv-cache-batch-create
Jun 10, 2023
Merged

[Unity] Add an API to create multiple kv caches with single allocation#15064
tqchen merged 2 commits intoapache:unityfrom
yelite:kv-cache-batch-create

Conversation

@yelite
Copy link
Copy Markdown
Contributor

@yelite yelite commented Jun 8, 2023

This would be useful when creating multiple kv caches with the same shape. On A10G, compared to creating 64 kv caches separately in LLaMA from mlc-llm, doing a single allocation can save about 35 ms.

@tqchen @junrushao

@tvm-bot
Copy link
Copy Markdown
Collaborator

tvm-bot commented Jun 8, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@yelite yelite changed the title [Unity] Add a batch API to create multiple kv caches with single allocation [Unity] Add an API to create multiple kv caches with single allocation Jun 8, 2023
Array<AttentionKVCache> result;
for (int i = 0; i < num_caches; ++i) {
// Use DLManagedTensor to prevent underlying memory from being freed
DLManagedTensor* data_view = block_view.ToDLPack();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely we can reuse the memory allocator(storage interface without having to go through DLPack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated the code to use storage interface and it looks cleaner. But now it could print a warning message if the requested allocator type mismatches from the allocator that is created at VM initialization.

@tqchen
Copy link
Copy Markdown
Member

tqchen commented Jun 9, 2023

cc @yzh119 @Hzfengsy

Copy link
Copy Markdown
Member

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for doing this, I just have a few minor comments.

int init_fill_count, int num_caches) {
DLDataType dtype = init_data->dtype;

int64_t cache_size = (dtype.bits * dtype.lanes + 7) / 8;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently the dtype is smaller than one byte, then we would pad it to one byte, is that correct?
FYI: Flexgen uses 4-bit KV cache, we can support it later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for now. Since subbyte are usually packed manually(the dtype is i32)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, make sense to me.

@tqchen tqchen merged commit e9ddd47 into apache:unity Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants