Does flashinfer support float datatype? #191

ZSL98 · 2024-03-26T15:45:49Z

The examples are all tensors of half() type. I wonder if flashinfer supports fp32 dtype?

chenzhuofu · 2024-06-05T07:08:07Z

I got the same question. I am instantiate the SinglePrefillWithKVCacheDispatched function, but found that it has static_assert(sizeof(DTypeIn) == 2); check. @yzh119 Does this for some implementation consideration?

yzh119 · 2024-06-05T07:55:46Z

The decode attention operators support fp32, we just need to add fp32 to this macro:

flashinfer/python/csrc/pytorch_extension_utils.h

Lines 36 to 51 in 5a38066

    
           [&]() -> bool {                                                                        \ 
        
             switch (pytorch_dtype) {                                                             \ 
        
               case at::ScalarType::Half: {                                                       \ 
        
                 using c_type = nv_half;                                                          \ 
        
                 return __VA_ARGS__();                                                            \ 
        
               }                                                                                  \ 
        
               case at::ScalarType::BFloat16: {                                                   \ 
        
                 using c_type = nv_bfloat16;                                                      \ 
        
                 return __VA_ARGS__();                                                            \ 
        
               }                                                                                  \ 
        
               default:                                                                           \ 
        
                 std::ostringstream oss;                                                          \ 
        
                 oss << __PRETTY_FUNCTION__ << " failed to dispatch data type " << pytorch_dtype; \ 
        
                 TORCH_CHECK(false, oss.str());                                                   \ 
        
                 return false;                                                                    \ 
        
             }                                                                                    \

For prefill/append attention, it's a little bit tricky, because many instructions such as ldmatrix (https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-ldmatrix) only supports 16bits, which makes it non-trivial to load fp32 tiles (especially the transposed load) from shared memory to registers. An option is to convert fp32 input to bf16 and use bf16 prefill attention kernels, we can design an api that accepts bf16/fp16 input and returns fp32 output in flashinfer.

chenzhuofu · 2024-06-05T09:03:29Z

The decode attention operators support fp32, we just need to add fp32 to this macro:

flashinfer/python/csrc/pytorch_extension_utils.h

Lines 36 to 51 in 5a38066

[&]() -> bool { \

switch (pytorch_dtype) { \

case at::ScalarType::Half: { \

using c_type = nv_half; \

return __VA_ARGS__(); \

} \

case at::ScalarType::BFloat16: { \

using c_type = nv_bfloat16; \

return __VA_ARGS__(); \

} \

default: \

std::ostringstream oss; \

oss << __PRETTY_FUNCTION__ << " failed to dispatch data type " << pytorch_dtype; \

TORCH_CHECK(false, oss.str()); \

return false; \

} \

For prefill/append attention, it's a little bit tricky, because many instructions such as ldmatrix (https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-ldmatrix) only supports 16bits, which makes it non-trivial to load fp32 tiles (especially the transposed load) from shared memory to registers. An option is to convert fp32 input to bf16 and use bf16 prefill attention kernels, we can design an api that accepts bf16/fp16 input and returns fp32 output in flashinfer.

Got it, my use case is prefill/append kernel and it looks tricky indeed. Thanks for your kind reply. I think the support of fp32 output sounds great and helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does flashinfer support float datatype? #191

Does flashinfer support float datatype? #191

ZSL98 commented Mar 26, 2024

chenzhuofu commented Jun 5, 2024

yzh119 commented Jun 5, 2024

chenzhuofu commented Jun 5, 2024 •

edited

Loading

Does flashinfer support float datatype? #191

Does flashinfer support float datatype? #191

Comments

ZSL98 commented Mar 26, 2024

chenzhuofu commented Jun 5, 2024

yzh119 commented Jun 5, 2024

chenzhuofu commented Jun 5, 2024 • edited Loading

chenzhuofu commented Jun 5, 2024 •

edited

Loading