Obtaining file:///home/lmzheng/flashinfer/python
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: flashinfer
  Running setup.py develop for flashinfer
    error: subprocess-exited-with-error
    
    × python setup.py develop did not run successfully.
    │ exit code: 1
    ╰─> [992 lines of output]
        running develop
        /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
        !!
        
                ********************************************************************************
                Please avoid running ``setup.py`` and ``easy_install``.
                Instead, use pypa/build, pypa/installer or other
                standards-based tools.
        
                See https://github.com/pypa/setuptools/issues/917 for details.
                ********************************************************************************
        
        !!
          easy_install.initialize_options(self)
        /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
        !!
        
                ********************************************************************************
                Please avoid running ``setup.py`` directly.
                Instead, use pypa/build, pypa/installer or other
                standards-based tools.
        
                See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
                ********************************************************************************
        
        !!
          self.initialize_options()
        running egg_info
        writing flashinfer.egg-info/PKG-INFO
        writing dependency_links to flashinfer.egg-info/dependency_links.txt
        writing top-level names to flashinfer.egg-info/top_level.txt
        reading manifest file 'flashinfer.egg-info/SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        warning: no previously-included files found matching 'tests/'
        no previously-included directories found matching '*/__pycache__'
        warning: no previously-included files matching '*.so' found anywhere in distribution
        writing manifest file 'flashinfer.egg-info/SOURCES.txt'
        running build_ext
        /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.0) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
          warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
        /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 12.0
          warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
        building 'flashinfer._kernels' extension
        Emitting ninja build file /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/build.ninja...
        Compiling objects...
        Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
        [1/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [2/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [3/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [4/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [5/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [6/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [7/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [8/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [9/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [10/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [11/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [12/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [13/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [14/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [15/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [16/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [17/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [18/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [19/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [20/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [21/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [22/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [23/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [24/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [25/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [26/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [27/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [28/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [29/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [30/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [31/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [32/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [33/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [34/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [35/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [36/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [37/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [38/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [39/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [40/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [41/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [42/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [43/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [44/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [45/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [46/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [47/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [48/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [49/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [50/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [51/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [52/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [53/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [54/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [55/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [56/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [57/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [58/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [59/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [60/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [61/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
        [62/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [63/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [64/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [65/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [66/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [67/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [68/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [69/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [70/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [71/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
        [72/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [73/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [74/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [75/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [76/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [77/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
        [78/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
        [79/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [80/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
        [81/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [82/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
        [83/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [84/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
        [85/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
        [86/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [87/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [88/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
        [89/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [90/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
        [91/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
        [92/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
        [93/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
        [94/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [95/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
        [96/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
        [97/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
        
        1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
        [98/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o
        /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                    argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
        
        Error limit reached.
        100 errors detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/batch_prefill.cu".
        Compilation terminated.
        ninja: build stopped: subcommand failed.
        Traceback (most recent call last):
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
            subprocess.run(
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/subprocess.py", line 528, in run
            raise CalledProcessError(retcode, process.args,
        subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
        
        The above exception was the direct cause of the following exception:
        
        Traceback (most recent call last):
          File "<string>", line 2, in <module>
          File "<pip-setuptools-caller>", line 34, in <module>
          File "/home/lmzheng/flashinfer/python/setup.py", line 210, in <module>
            setuptools.setup(
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/__init__.py", line 103, in setup
            return distutils.core.setup(**attrs)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
            return run_commands(dist)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
            dist.run_commands()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
            self.run_command(cmd)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/dist.py", line 989, in run_command
            super().run_command(command)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
            cmd_obj.run()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py", line 34, in run
            self.install_for_development()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py", line 109, in install_for_development
            self.run_command('build_ext')
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
            self.distribution.run_command(command)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/dist.py", line 989, in run_command
            super().run_command(command)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
            cmd_obj.run()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 88, in run
            _build_ext.run(self)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
            self.build_extensions()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
            build_ext.build_extensions(self)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
            self._build_extensions_serial()
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
            self.build_extension(ext)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
            _build_ext.build_extension(self, ext)
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
            objects = self.compiler.compile(
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
            _write_ninja_file_and_compile_objects(
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
            _run_ninja_build(
          File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
            raise RuntimeError(message) from e
        RuntimeError: Error compiling objects for extension
        [end of output]
    
    note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× python setup.py develop did not run successfully.
│ exit code: 1
╰─> [992 lines of output]
    running develop
    /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
    !!
    
            ********************************************************************************
            Please avoid running ``setup.py`` and ``easy_install``.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.
    
            See https://github.com/pypa/setuptools/issues/917 for details.
            ********************************************************************************
    
    !!
      easy_install.initialize_options(self)
    /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
    !!
    
            ********************************************************************************
            Please avoid running ``setup.py`` directly.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.
    
            See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
            ********************************************************************************
    
    !!
      self.initialize_options()
    running egg_info
    writing flashinfer.egg-info/PKG-INFO
    writing dependency_links to flashinfer.egg-info/dependency_links.txt
    writing top-level names to flashinfer.egg-info/top_level.txt
    reading manifest file 'flashinfer.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files found matching 'tests/'
    no previously-included directories found matching '*/__pycache__'
    warning: no previously-included files matching '*.so' found anywhere in distribution
    writing manifest file 'flashinfer.egg-info/SOURCES.txt'
    running build_ext
    /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.0) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
      warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
    /home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 12.0
      warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
    building 'flashinfer._kernels' extension
    Emitting ninja build file /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/build.ninja...
    Compiling objects...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    [1/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [2/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [3/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [4/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [5/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [6/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [7/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [8/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [9/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [10/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [11/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [12/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [13/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [14/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [15/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [16/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [17/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [18/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [19/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [20/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [21/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [22/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [23/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [24/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [25/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [26/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [27/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [28/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [29/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [30/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [31/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [32/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [33/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [34/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [35/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [36/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [37/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [38/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [39/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [40/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [41/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [42/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [43/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [44/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [45/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [46/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [47/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [48/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [49/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [50/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [51/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [52/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [53/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [54/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [55/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [56/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [57/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [58/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [59/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [60/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [61/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_bf16.cu".
    [62/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [63/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [64/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [65/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [66/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [67/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [68/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [69/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [70/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [71/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_bf16.cu".
    [72/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [73/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [74/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [75/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [76/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [77/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_fp16.cu".
    [78/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_bf16.cu".
    [79/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [80/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_fp16.cu".
    [81/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [82/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutHND_rotaryNone_fp16.cu".
    [83/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [84/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head64_causalFalse_fp16qkFalse_layoutHND_rotaryLlama_bf16.cu".
    [85/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutHND_rotaryNone_bf16.cu".
    [86/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [87/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [88/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryLlama_fp16.cu".
    [89/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [90/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryLlama_bf16.cu".
    [91/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_fp16.cu".
    [92/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutHND_rotaryNone_fp16.cu".
    [93/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalFalse_fp16qkFalse_layoutNHD_rotaryNone_fp16.cu".
    [94/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalFalse_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [95/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kHND, GROUP_SIZE=4U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kLlama, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=false, DTypeIn=nv_half, DTypeOut=nv_half, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group4_head128_causalFalse_fp16qkTrue_layoutHND_rotaryLlama_fp16.cu".
    [96/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=128U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=false, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head128_causalTrue_fp16qkFalse_layoutNHD_rotaryNone_bf16.cu".
    [97/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu(7): error: function "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched<page_storage,kv_layout,GROUP_SIZE,HEAD_DIM,ROTARY_MODE,ALLOW_FP16_QK_REDUCTION,CAUSAL,DTypeIn,DTypeOut,IdType>(flashinfer::BatchPrefillHandler *, DTypeIn *, IdType *, flashinfer::paged_kv_t<page_storage, kv_layout, DTypeIn, IdType>, DTypeOut *, float *, float, float, cudaStream_t) [with page_storage=flashinfer::PageStorage::kIndices, kv_layout=flashinfer::QKVLayout::kNHD, GROUP_SIZE=1U, HEAD_DIM=64U, ROTARY_MODE=flashinfer::RotaryMode::kNone, ALLOW_FP16_QK_REDUCTION=true, CAUSAL=true, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" cannot be instantiated -- no template definition was supplied
    
    1 error detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/generated/paged_batch_prefill_group1_head64_causalTrue_fp16qkTrue_layoutNHD_rotaryNone_bf16.cu".
    [98/580] /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    FAILED: /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o
    /usr/local/cuda/bin/nvcc  -I/home/lmzheng/flashinfer/python/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/TH -I/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/lmzheng/anaconda3/envs/fastchat/include/python3.9 -c -c /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu -o /home/lmzheng/flashinfer/python/build/temp.linux-x86_64-cpython-39/csrc/batch_prefill.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kHND, nv_half, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    /home/lmzheng/flashinfer/python/csrc/batch_prefill.cu(99): error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapperDispatched" matches the argument list
                argument types are: (flashinfer::BatchPrefillHandler *, c_type *, int32_t *, std::nullptr_t, flashinfer::paged_kv_t<flashinfer::PageStorage::kIndices, flashinfer::QKVLayout::kNHD, nv_bfloat16, int32_t>, c_type *, float *, float, float, cudaStream_t)
    
    Error limit reached.
    100 errors detected in the compilation of "/home/lmzheng/flashinfer/python/csrc/batch_prefill.cu".
    Compilation terminated.
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
        subprocess.run(
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/subprocess.py", line 528, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/home/lmzheng/flashinfer/python/setup.py", line 210, in <module>
        setuptools.setup(
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/__init__.py", line 103, in setup
        return distutils.core.setup(**attrs)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
        return run_commands(dist)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
        dist.run_commands()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
        self.run_command(cmd)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/dist.py", line 989, in run_command
        super().run_command(command)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/develop.py", line 109, in install_for_development
        self.run_command('build_ext')
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
        self.distribution.run_command(command)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/dist.py", line 989, in run_command
        super().run_command(command)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 88, in run
        _build_ext.run(self)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
        self.build_extensions()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
        build_ext.build_extensions(self)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
        self._build_extensions_serial()
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
        self.build_extension(ext)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
        _build_ext.build_extension(self, ext)
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
        objects = self.compiler.compile(
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/home/lmzheng/anaconda3/envs/fastchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.