Skip to content

HIPBlas AMD Radeon VII under windows. Quits out after "ggml_backend_sched: failed to allocate graph, reserving" #5712

@moozoo64

Description

@moozoo64

llama.cpp 25/02/2024 "git pull" ~[b2254]
Windows 10 (latest fully updated) i7-4770 16GB ram
Radeon VII 16GB vram
I'm compiling and running miniconda
I can build and run the Vulkan version fine.
However when I try HipBlas both main and server quit out after "ggml_backend_sched: failed to allocate graph, reserving"
Is this because ggml-cuda.cu doesn't support gfx906?

Building from latest Release
$env:path = "C:\SDK\LLM\ccache-4.9.1-windows-x86_64;C:\Program Files\AMD\ROCm\5.7\bin;" + $env:path
$env:HIP_PLATFORM="amd"
HipInfo
....
gcnArchName: gfx906:sramecc-:xnack-
.....
mkdir build
cd .\build
cmake -G Ninja -DAMDGPU_TARGETS=gfx906 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
{no errors}
cmake --build .
{bunch of warning related to [-Wgnu-zero-variadic-macro-arguments] , [-Wlanguage-extension-token] etc}
cd .\bin
.\main.exe -m H:\LLM\models\tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -ngl 40 -sm none -n -1 --color -r "User:" --in-prefix " " -i -e -p "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:"

Appears to load up using Rocm.


Log start
main: build = 2254 (9e359a4)
main: built with for x86_64-pc-windows-msvc
main: seed = 1708864556
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
........
........

llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: ROCm0 buffer size = 702.14 MiB
llm_load_tensors: CPU buffer size = 42.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: ROCm_Host input buffer size = 6.01 MiB
ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 66.50 MiB
ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 4.00 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 66.50 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched: failed to allocate graph, reserving
(LLM) PS C:\SDK\LLM\llama.cpp\build\bin>

Tried different models llama2, mistral different sizes 3b,7b,13b and different F16, Q8_0, Q4_0 etc all the same.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions