Float8 cache usage #155

YLGH · 2024-03-05T04:25:10Z

Hi! I'm playing with batch_decode_with_padded_kv_cache and wanted to test out the FP8 KVCache. I couldn't find some good instructions on the docs,

I've tried the following:

num_qo_heads = 32
num_kv_heads = 32
batch_size = 16
head_dim = 128 
padded_kv_len = 1024


q = torch.empty(
                batch_size,
                num_qo_heads,
                head_dim,
                device=torch.device("cuda"),
                dtype=torch.float8_e4m3fn,
            )
k_padded = torch.randn(batch_size, padded_kv_len, num_kv_heads, head_dim).to("cuda:0").to(torch.float8_e4m3fn)
v_padded = torch.randn(batch_size, padded_kv_len, num_kv_heads, head_dim).to("cuda:0").to(torch.float8_e4m3fn)
o = flashinfer.batch_decode_with_padded_kv_cache(
    q, k_padded, v_padded, "NHD", "NONE"
)

But it gives me a BatchDecodeWithPaddedKVCache kernel launch failed: supported data type.

How can I enable FP8 KV cache? Thanks in advance!

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-03-05T05:34:28Z

refer to #150

requested in #150 #155 #125

yzh119 · 2024-03-05T14:23:31Z

@YLGH done in #156 .

yzh119 mentioned this issue Mar 5, 2024

feat: pytorch api of fp8 kv-cache #156

Merged

yzh119 added a commit that referenced this issue Mar 5, 2024

feat: pytorch api of fp8 kv-cache (#156)

66ee066

requested in #150 #155 #125

yzh119 closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float8 cache usage #155

Float8 cache usage #155

YLGH commented Mar 5, 2024

zhyncs commented Mar 5, 2024

yzh119 commented Mar 5, 2024

Float8 cache usage #155

Float8 cache usage #155

Comments

YLGH commented Mar 5, 2024

zhyncs commented Mar 5, 2024

yzh119 commented Mar 5, 2024