- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
Description
llama.cpp 25/02/2024 "git pull" ~[b2254]
Windows 10 (latest fully updated) i7-4770 16GB ram
Radeon VII 16GB vram
I'm compiling and running miniconda
I can build and run the Vulkan version fine.
However when I try HipBlas both main and server quit out after  "ggml_backend_sched: failed to allocate graph, reserving"
Is this because ggml-cuda.cu doesn't support gfx906?
Building from latest Release
$env:path = "C:\SDK\LLM\ccache-4.9.1-windows-x86_64;C:\Program Files\AMD\ROCm\5.7\bin;" + $env:path
$env:HIP_PLATFORM="amd"
HipInfo
....
gcnArchName:                      gfx906:sramecc-:xnack-
.....
mkdir build
cd .\build
cmake -G Ninja -DAMDGPU_TARGETS=gfx906 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
{no errors}
cmake --build .
{bunch of warning related to [-Wgnu-zero-variadic-macro-arguments] , [-Wlanguage-extension-token] etc}
cd .\bin
.\main.exe -m H:\LLM\models\tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -ngl 40 -sm none -n -1 --color -r "User:" --in-prefix " " -i -e -p "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:"
Appears to load up using Rocm.
Log start
main: build = 2254 (9e359a4)
main: built with  for x86_64-pc-windows-msvc
main: seed  = 1708864556
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
........
........
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   702.14 MiB
llm_load_tensors:        CPU buffer size =    42.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     6.01 MiB
ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 66.50 MiB
ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 4.00 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    66.50 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched: failed to allocate graph, reserving
(LLM) PS C:\SDK\LLM\llama.cpp\build\bin>
Tried different models llama2, mistral different sizes 3b,7b,13b and different F16, Q8_0, Q4_0 etc all the same.