Skip to content

Fix: Propagate flash attn to model loader#1424

Merged
abetlen merged 1 commit intoabetlen:mainfrom
dthuerck:propagate-flash-attn
May 3, 2024
Merged

Fix: Propagate flash attn to model loader#1424
abetlen merged 1 commit intoabetlen:mainfrom
dthuerck:propagate-flash-attn

Conversation

@dthuerck
Copy link
Copy Markdown
Contributor

@dthuerck dthuerck commented May 3, 2024

I noticed that even though setting flash_attn to true in my model config file, llama.cpp kept reporting llama_new_context_with_model: flash_attn = 0. This super-small PR fixes that - turns out the setting wasn't passed on to the model loader.

@abetlen
Copy link
Copy Markdown
Owner

abetlen commented May 3, 2024

@dthuerck thank you!

@abetlen abetlen merged commit 2138561 into abetlen:main May 3, 2024
@BadisG
Copy link
Copy Markdown

BadisG commented May 8, 2024

I installed the latest version of llama_cpp_python (0.2.70) with this command:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

But after using it on oobabooga's software (llama_cpp_hf), I still have this flash_attn = 0 issue:

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1728.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   400.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   596.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    32.02 MiB
llama_new_context_with_model: graph nodes  = 1208
llama_new_context_with_model: graph splits = 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants