Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Mac Metal ggml_metal_graph_compute #5436

Closed
irbull opened this issue Feb 10, 2024 · 2 comments
Closed

Memory leak in Mac Metal ggml_metal_graph_compute #5436

irbull opened this issue Feb 10, 2024 · 2 comments

Comments

@irbull
Copy link
Contributor

irbull commented Feb 10, 2024

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

There appears to be a small memory leak in ggml_metal_graph_compute. After running a continual inference a few hundred times, I notice the amount of memory on my M1 constantly growing.

I've been tracking this for a while, and it appears to come from the decode function, and specifically in the ggml_metal_graph_compute. I've removed the entire contents of the dispatch_apply and the memory still seems to be leaking. There appears to be a few "known issues" around the MLTCommandBuffer leaking memory [1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

There is a suggestion to use a @autoreleasepool when using the MLTCommandBuffer. After adding this, I can confirm that the memory usage of Llama.cpp stays stable even after 1,000 inference requests

irbull added a commit to irbull/llama.cpp that referenced this issue Feb 10, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov#5436
irbull added a commit to irbull/llama.cpp that referenced this issue Feb 10, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov#5436
ggerganov pushed a commit that referenced this issue Feb 10, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses #5436
github-actions bot pushed a commit to KerfuffleV2/ggml-sys-bleedingedge that referenced this issue Feb 10, 2024
== Relevant log messages from source repo:

commit f026f8120f97090d34a52b3dc023c82e0ede3f7d
Author: Ian Bull <irbull@eclipsesource.com>
Date:   Sat Feb 10 02:53:28 2024 -0800

    metal : use autoreleasepool to avoid memory leaks (#5437)

    There appears to be a known memory leak when using the
    `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
    [1,2]

    [1] https://developer.apple.com/forums/thread/662721
    [2] https://forums.developer.apple.com/forums/thread/120931

    This change-set wraps the `ggml_metal_graph_compute` in a
    `@autoreleasepool`.

    This commit addresses ggerganov/llama.cpp#5436

commit 4633d93af08d890ecd00fa6e4f61d76f21cded4c
Author: Michael Podvitskiy <podvitskiymichael@gmail.com>
Date:   Fri Feb 9 10:42:27 2024 +0100

    ggml : add abort_callback for cpu backend (ggml/725)

    * a way to use abort_callback with the cpu backend

    * whisper update
ggerganov pushed a commit to ggerganov/ggml that referenced this issue Feb 12, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov/llama.cpp#5436
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this issue Feb 12, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov/llama.cpp#5436
Simon-Count added a commit to codesphere-cloud/llama2-chat that referenced this issue Feb 22, 2024
* sync : ggml

* metal : free metal objects (#5161)

* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix

* cmake : fix Vulkan build (#5182)

* ggml : add max buffer sizes to opencl and metal backends (#5181)

* py : improve BPE tokenizer support (#5189)

* py : fix except (#5194)

ggml-ci

* server : embeddings compatibility for OpenAI (#5190)

* fix typo "RLIMIT_MLOCK" (#5175)

* Nomic Vulkan backend (#4456)

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

* kompute : fix fallback to CPU (#5201)

* ggml alloc: Fix for null dereference on alloc failure (#5200)

* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix

* server : improve README (#5209)

* readme : update hot topics

* readme : minor (#5204)

This is about tuning the code formatting of the README file

* main : allow empty --prompt-cache file (#5176)

* allow empty --prompt-cache file

This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user.

I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++  < 17
(the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17)

* formatting

(requested in codereview)

* remove c++17, file_is_empty

* quantize : fix typo (#5211)

Fix misprint in quantize help

* Vulkan Windows APU Memory Handling (#5199)

* Add basic UMA memory handling

Improve memory OOM behavior

Fix tests

* Fix UMA handling

* Also fix UMA handling for prealloc buffers

* Remove unnecessary warning message

* Remove outdated comment

* SOTA 3-bit quants  (#5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Faster AVX2 dot product for IQ2_XS (#5187)

* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>

* metal : add debug capture backend function (ggml/694)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* ci : fix yolo URLs + fix metal capture (ggml/712)

* gguf : add input validation, prevent integer overflows (ggml/709)

* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci

* `ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <slarengh@gmail.com>

* gguf : fix comparison (ggml/715)

ggml-ci

* sync : ggml (#0)

* ggml : fix IQ3_XXS on Metal (#5219)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* server : change deps.sh xxd files to string literals (#5221)

* Changed ugly xxd to literals.

HPP files are much more readable as multiline literals rather than hex arrays.

* Dashes in literal variable names.

Replace . and - with _ in file names -> variable names.

* Comment on removing xxd.

XXD-> string literals

* XXD to string literals.

Replaced these unreadable headers with string literal versions using new deps.sh.

* server : fix context shift (#5195)

* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes

* Revert "server : change deps.sh xxd files to string literals (#5221)"

This reverts commit 4003be0e5feef320f3707786f22722b73cff9356.

* kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)

* support SYCL backend windows build (#5208)

* support SYCL backend windows build

* add windows build in CI

* add for win build CI

* correct install oneMKL

* fix install issue

* fix ci

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix win build

* fix win build

* fix win build

* restore other CI part

* restore as base

* rm no new line

* fix no new line issue, add -j

* fix grammer issue

* allow to trigger manually, fix format issue

* fix format

* add newline

* fix format

* fix format

* fix format issuse

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)

* Vulkan Fixes (#5223)

* Fix Vulkan F16 models

* Fix Vulkan context shift crash

* Add Vulkan to common.cpp dump_non_result_info_yaml function

* Fix bug in Vulkan CPY op

* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

---------

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

* ggml : limit n_threads to the max n_tasks (#5238)

* format license text, restore apache license by legal suggestion (#5233)

* llava : add MobileVLM support (#5132)

* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>

* metal : add im2col F32 dst support (#5132)

* llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)

* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>

* llama : reorder build_orion() at correct place (#5118)

* Fix broken Vulkan Cmake (properly) (#5230)

* build vulkan as object

* vulkan ci

* llama : support InternLM2 (#5184)

* support InternLM2 inference
  * add add_space_prefix KV pair

* make : generate .a library for static linking (#5205)

* cuda : fix LLAMA_CUDA_F16 (#5262)

* Vulkan Phi Fix for AMD Proprietary Drivers (#5260)

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug

* add --no-mmap in llama-bench (#5257)

* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload

* llama : fix memory leak in llama_batch_free (#5252)

The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.

This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.

* [SYCL] update guide of SYCL backend (#5254)

* update guide for make installation, memory, gguf model link,  rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues

* [SYCL] get MAX_MEM_ALLOC from device property (#5270)

* get max alloc size from device prop

* fix macro typo

* docker : add build for SYCL, Vulkan + update readme (#5228)

* add vulkan dockerfile

* intel dockerfile: compile sycl by default

* fix vulkan dockerfile

* add docs for vulkan

* docs: sycl build in docker

* docs: remove trailing spaces

* docs: sycl: add docker section

* docs: clarify install vulkan SDK outside docker

* sycl: use intel/oneapi-basekit docker image

* docs: correct TOC

* docs: correct docker image for Intel oneMKL

* Tidy ggml-sycl (#5261)

* Tidy some code in ggml-sycl

* Remove blank space

* Remove std::printf comments

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* py : add check for '.attn.masked_bias' layers to GPT2model (#5281)

* scripts : parse wtype in server-llm.sh (#5167)

* scripts : parse wtype in server-llm.sh

* scripts : fix check for wfile

* perplexity : fix KL divergence calculations on Windows (#5273)

* Fix im2col with 32fp (#5286)

* readme : add tenere in the ui tools list (#5284)

* YaRN : store rope scaling type as int32_t in memory (#5285)

* YaRN : store rope scaling type as int32_t in memory

* llama : store mapped names as const char *

* refactor : switch to emplace_back to avoid extra object (#5291)

* Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)

* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt

* add Vulkan support to Nix flake

* make: fix nvcc optimization flags for host code (#5309)

* make: add nvcc info print (#5310)

* cmake : use set() for LLAMA_WIN_VER (#5298)

option() is specifically for booleans.

Fixes #5158

* Adding some imatrix tools (#5302)

* imatrix: adding --combine and --continue-from

* imatrix: be able to start from a specific chunk

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* flake.lock: Update

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
  → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
  → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
  → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)

* [SYCL] Fix cpy with dims of 3 (#5289)

* Fix cpy with dims of 3

* rm asserts

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* readme : add CodeShell models to the supported models list (#5330)

* scripts : add non-interactive server-llm.sh (#5303)

* Update server-llm.sh

Add flag --non-interactive that allows run script without asking a permission

* Update scripts/server-llm.sh

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* scripts : fix typos, cleanup (#5303)

* common : add dynamic temperature parameters to main example cli (#5295)

* added dynamic temp params in main

* added help text

* server : allow to get default generation settings for completion (#5307)

* iq2_xxs: tune quantization (#5320)

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* py : fix internlm2-hf convert to gguf (#5305)

* py : fix internlm2-hf convert to gguf

* ggml-ci

* iq3_xxs: quards for the no-imatrix situation (#5334)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* ggml : avoid duplicating function calls using MIN/MAX macros (#5325)

* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* ggml : make use of ggml-quants.h possible in C++ code (#5338)

* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make: Use ccache for faster compilation (#5318)

* make: Use ccache for faster compilation

* py : handle byte tokens in `get_token_type` (#5341)

* py : handle byte tokens in `get_token_type`

* py : fix empty bytes arg

* server : various fixes for the prompt field in /completion (#5300)

server : fix deadlock when prompt array contains strings and numbers

server : removed an unnecessary generation when generating multi-prompts

server : removed an unnecessary assert

* server : add `dynatemp_range` and `dynatemp_exponent` (#5352)

* server: added `dynatemp_range` and `dynatemp_exponent`

* Update README.md

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>

* server : include total "num_slots" in props endpoint (#5349)

* CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)

* readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)

* Slight quantization improvement for Q4_K and Q5_K (#5361)

* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Update README.md (#5366)

Add some links to quantization related PRs

* CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)

* server : remove model.json endpoint (#5371)

* convert : fix TypeError on GPT-2 vocab.json (#5288)

* server : update `/props` with "total_slots" value (#5373)

* include total "num_slots" in default_generation_settings_for_props

* cleanup total_slots return value in /props endpoint

* update /props endpoint docs with total_slots

* remove num_slots from default_generation_settings_for_props

* update /props endpoint section

* llama : add MiniCPM support (#5346)

* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm

* readme : update ui list (#5354)

* readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md

* Basic Vulkan Multi-GPU implementation (#5321)

* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <slarengh@gmail.com>

* llava-cli : always tokenize special tokens (#5382)

* llava-cli: tokenize special tokens in prompt

* llava-cli: use the escape CLI argument, remove incomplete separate escaping process

* [SYCL] update install make by w64devkit (#5297)

* CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)

* Add Ava in the list of llama.cpp UIs (#4362)

* fix typo in readme (#5399)

Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>

* CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

* tests : .gitignore obj files

* sampling: fix top_k <= 0 (#5388)

* sampling: fix top_k <= 0

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llava: fix typo/formatting in README.md (#5405)

This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:

Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : fix MiniCPM (#5392)

* fix bug for norm_rms_eps missing

* to align with the same order as convert.py for model write

* fix: undo HF models permute tensor

* update for flake8 lint

* fix trailing whitespace (#5407)

* llava : add missing .py, and fix paths in README.md (#5414)

This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Fix f16_sycl cpy call from Arc (#5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix

* llama : do not print "offloading layers" message in CPU-only builds (#5416)

* CUDA: more warps for mmvq on NVIDIA (#5394)

* Fix Vulkan crash on APUs with very little device memory (#5424)

* Fix Vulkan crash on APUs with very little device memory

* Fix debug output function names

* ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)

* readme : add JavaScript/Wasm repo (#5415)

* llama : do not cap thread count when MoE on CPU (#5419)

* Not capping thread count when MoE inference is running on CPU

* Whitespace

* server : fix prompt caching for repeated prompts (#5420)

* llava : add requirements.txt and update README.md (#5428)

* llava: add requirements.txt and update README.md

This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.

The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llava: fix typo in llava-surgery.py output

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

* ggml : add abort_callback for cpu backend (ggml/725)

* a way to use abort_callback with the cpu backend

* whisper update

* sync : ggml

* scripts : update sync scripts with new backends

* metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

* server : add llama2 chat template (#5425)

* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* lookup: add print for drafting performance (#5450)

* ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

* ggml : fix compile warnings (unused vars) (#4966)

* common : fix compile warning

* main : ctrl+C print timing in non-interactive mode (#3873)

* server : allow to specify tokens as strings in logit_bias (#5003)

* server: allow to specify tokens as strings in logit_bias

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : use enums for sampler types (#5418)

* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* vulkan: only use M-sized matmul on Apple GPUs (#5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <slp@redhat.com>

* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>

* flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
  → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

* Add support for BERT embedding models (#5423)

* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* sync : ggml (#5452)

* ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

* sync : ggml

* update llama.cpp, clip.cpp, export-lora.cpp

* update finetune.cpp, train-text-from-scratch.cpp

ggml-ci

* ggml-backend : reduce alignment to 32 to match gguf and fix mmap

---------

Co-authored-by: slaren <slarengh@gmail.com>

* llava : remove prog parameter from ArgumentParser (#5457)

* llava: remove prog parameter from ArgumentParser

This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.

The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* ci: add W503 to flake8 ignore list

This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* ggml-sycl: Replace 3d ops with macro  (#5458)

* use macro

* use macro

* fix format

* py : fix persimmon `n_rot` conversion (#5460)

* convert : fix persimmon offical weight conversion to write correct n_rot.

* Update convert-persimmon-to-gguf.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* swift : package no longer use ggml dependency (#5465)

* Revert "swift : update Package.swift to use ggml as dependency (#4691)"

This reverts commit ece9a45e8ffb73ad461c792720c2fec28b0137bc.

* spm : add ggml headers

* llama : fix quantization when tensors are missing (#5423)

* ggml-quants : fix compiler warnings (shadow variable) (#5472)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* tests : disable moe test (#5473)

* bert : add tests + fix quantization (#5475)

* llama : do not quantize pos embd and token type tensors

* ci : add BERT tests

ggml-ci

* ci : do not do BERT tests on low-perf nodes

ggml-ci

* make: add error message for bad CUDA version (#5444)

* make: add error message for bad CUDA version

* Update Makefile

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : support batched embeddings (#5466)

* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : multi-thread the tokenizer tests (#5474)

* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci

* finetune : rename feed-forward tensors (w1/w2/w3) (#4839)

* finetune: rename feed-forward tensors (w1/w2/w3)

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* train-text-from-scratch: rename ff tensors

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : make load error reporting more granular (#5477)

Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.

* llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478)

* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs

* llama : add support for Nomic Embed (#5468)

* gguf : add python reader example (#5216)

* Update CMakeLists.txt

* Create reader.py

* Update reader.py

* Update reader.py

another whitespace :|

* Update reader.py

* lintlintlint

* Early return for zero size calls to get_tensor. (#5482)

* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Early return after the assertions.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

---------

Signed-off-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llava : support v1.6 (#5267)

* Create llava-survery-v2.py

* Update convert-image-encoder-to-gguf.py

* Update convert-image-encoder-to-gguf.py

* Rename llava-survery-v2.py to llava-surgery-v2.py

* Update convert-image-encoder-to-gguf.py

will now search for projector

* Update convert-image-encoder-to-gguf.py

whoops

* Update llava-surgery-v2.py

* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening

* whitespace corrections

* ws

* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.

* ws

* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli

* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed

* ws

* convert : skip unknown tensors (need for LLaVA)

* llava : update readme

* llava : fix compile warnings

* llava : style

* convert : add --skip-unknown CLI arg

* server : remove clip structs

* bugfix for non llava-1.6

It should now work with llava-1.5 as well

* clip : minor code rearrange

* llava : update readme a bit

---------

Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cmake : ARM intrinsics detection for MSVC (#5401)

* llava : update README.md (#5489)

* Update README.md

* Update README.md

* Update examples/llava/README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* readme : fix typo (#5490)

executabhle -> executable

* vulkan: Find optimal memory type but with fallback (#5381)

* @0cc4m feedback

* More feedback @0cc4m

* llaba : hotfix for llava-1.6 image number (#5495)

Co-authored-by: John <cmt-nct@users.noreply.github.com>

* llava : fix memory management bug (#5491)

* Fix memory management in llava and server code

Fixes this error:

llama_new_context_with_model: graph splits (measure): 3
Available slots:
 -> Slot 0 - max context: 6000
{"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 - loaded image
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 1]
munmap_chunk(): invalid pointer
Aborted

* Make it cleaner by checking size in batch free wrapper

* fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487)

* fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false

* fix(gguf-py): added missing cls and mask token ids to the gguf metadata

* scripts : add hf.sh helper script (#5501)

* scripts : add hf.sh helper scripts

* hf : add error logs

* hf : add support for --repo and --file

* cuda : print message when initialization fails (#5512)

* cuda : print message when initialization fails

* use CUDA_NAME both times

* clip : fix wrong loop condition

* Use correct type of pooling for embedding models (#5500)

Use correct type of pooling for embedding models

* ci : fix BERT model download and convert

* llava : fix clip-model-is-vision flag in README.md (#5509)

* llava: fix clip-model-is-vision flag in README.md

This commit fixes the flag `--clip_model_is_vision` in README.md which
is does not match the actual flag:
```console
$ python convert-image-encoder-to-gguf.py --help
...
  --clip-model-is-vision
                        The clip model is a pure vision model
                        (ShareGPT4V vision extract for example)
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llava: update link to vit config in README.md

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* ggml : add numa options (#5377)

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>

* server : fix system prompt cli (#5516)

* server : add "samplers" param to control the samplers order (#5494)

* llama : minor fixed return int value (#5529)

* llava : removed excess free(NULL) operation (#5531)

* scripts : add helpers script for bench comparing commits (#5521)

* scripts : add helpers script for bench comparing commits

* scripts : detect CUDA

* set flags after checking the command line

* fix make flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

* gitignore : update for CLion IDE (#5544)

* ci : add an option to fail on compile warning (#3952)

* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* ggml : add ALiBi support for ggml_soft_max_ext (#5488)

* ggml : avoid recomputing alibi slopes (CPU)

* llama : reuse hparams.f_max_alibi_bias in all cases

ggml-ci

* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)

ggml-ci

* ggml : handle all SRCs (do not break on first null)

ggml-ci

* tests : do not use slope for large soft_max

accumulates too much error

ggml-ci

* ggml : alternative ALiBi without extra tensor

We compute the slopes in the kernel

ggml-ci

* cuda : add ALiBi support in ggml_soft_max_ext

ggml-ci

* ggml : deprecate ggml_alibi

* ggml : support multi-sequence ALiBi (Metal)

ggml-ci

* cuda : add multi-seq ALiBi + remote F16 soft_max

ggml-ci

* ggml : update deprecation message

* ggml : fix pos ptr when no ALiBi

ggml-ci

* cuda : fix performance (pow -> powf)

* cuda : precompute ALiBi constants

* metal : pre-compute ALiBi slopes

ggml-ci

* llama : init kq_pos only if needed

ggml-ci

* test-backend-ops : add null pos test to soft_max

test-backend-ops : replace soft_max tests

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>

* flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
  → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)

* 1.5 bit quantization (#5453)

* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* llava : update surgery script to not remove tensors (#5536)

This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.

The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.

I did not perform this change for BakLLaVA models as I am not sure
how that part works.

* ggml, common, examples, tests : fixed type arguments in printf (#5528)

* common : fix ub (#5530)

* server : graceful server shutdown (#5244)

This updates the server queue to support graceful shutdown of the server on signals.

* server : --n-predict option document and cap to max value (#5549)

* server: document --n-predict

* server: ensure client request cannot override n_predict if set

* server: fix print usage LF in new --n-predict option

* server : enhanced health endpoint (#5548)

* server: enrich health endpoint with available slots, return 503 if not slots are available

* server: document new status no slot available in the README.md

* cmake : fix GGML_USE_SYCL typo (#5555)

* sampling : do not set min_keep to n_probs (#5564)

* server : slots monitoring endpoint (#5550)

* common, server : surface min_keep as its own parameter (#5567)

* Feature - surface min_keep as its own parameter

* Updated README with min_keep param

* metal : fix unused warnings (#0)

* ci : fix wikitext url + compile warnings (#5569)

ggml-ci

* ggml : restore vec dot stride arg names (#5453)

* build : pass all warning flags to nvcc via -Xcompiler (#5570)

* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler

* ggml : android and old glibc NUMA incompatibility bugfixes (#5557)

* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <root@nenya.lothlorien.ca>

* readme : update (#5572)

Added 1.5-bit on README.md

* cuda, metal : fix nans in soft_max (#5574)

* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : add llama_chat_apply_template() (#5538)

* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"

* baby-llama : allocate graphs in ggml_context (#5573)

* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llava : avoid changing the original BakLLaVA model (#5577)

This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.

I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* minor : fix trailing whitespace (#5538)

* cmake : remove obsolete sycl compile flags (#5581)

* rm unwanted sycl compile options

* fix bug

* fix bug

* format fix

* readme : fix typo in README-sycl.md (#5353)

* make : fix CUDA build (#5580)

* ci : enable -Werror for CUDA builds (#5579)

* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci

* metal : option to embed MSL source into compiled binary (whisper/1842)

* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process

* ggml-alloc : apply ggml/731

* sync : ggml

ggml-ci

* llava : replace ggml_cpy with ggml_cont

* llava : remove extra cont (#5587)

* examples : support minItems/maxItems in JSON grammar converter (#5039)

* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598)

* cuda : ignore peer access already enabled errors (#5597)

* cuda : ignore peer access already enabled errors

* fix hip

* Allow for Vulkan build with Accelerate.

Closes #5304

* Resolve ErrorIncompatibleDriver with Vulkan on MacOS.

Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476

* Add preprocessor checks for Apple devices.

Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files

* Add check for VK_KHR_portability_enumeration for MoltenVK support

* Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init()

* Enable Vulkan MacOS CI

* nix: now that we can do so, allow MacOS to build Vulkan binaries

Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000

* Update ggml_sycl_op_mul_mat_vec_q (#5502)

* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* server : health endpoint configurable failure on no slot (#5594)

* metal : add build system support for embedded metal library (#5604)

* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* readme : update UI list (#5605)

* Add maid to ui list

* Specify licence

* Server: use llama_chat_apply_template (#5593)

* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llava : add explicit instructions for llava-1.6 (#5611)

This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* make : fix debug build with CUDA (#5616)

* server : support llava 1.6 (#5553)

* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build

* IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)

* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* [SYCL] conext add name (#5624)

* [SYCL] conext add name

* name should start with SYCL*

* llama : add `gemma` model (#5631)

There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.

* llava : add --skip-unknown to 1.6 convert.py (#5632)

This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.

The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* readme : update hot topics

* sync : ggml (#5633)

* ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <bssrdf@gmail.com>

* ggml : compute forward no longer pass src tensors (ggml/729)

* sync : ggml

ggml-ci

---------

Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>

* readme : add LocalAI to the availables UI (#5629)

* server: health: fix race condition on slots data using tasks queue (#5634)

* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.

* sync : ggml

* examples : do not assume BOS when shifting context (#5622)

* gemma : allow offloading the output tensor (#5646)

* llama : fix session save/load with quantized KV (#5649)

* Add docs for llama_chat_apply_template (#5645)

* add docs for llama_chat_apply_template

* fix typo

* llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

* mpt : add optional bias tensors (#5638)

Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.

* server : clarify some params in the docs (#5640)

* server : fallback to chatml, add AlphaMonarch chat template (#5628)

* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO

* readme : update hot topics

* minor : fix trailing whitespace (#5638)

* workflows: nix: hardcode cachix ids, build unconditionally (#5663)

GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. 

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.

* Add Gemma chat template (#5665)

* add gemma chat template

* gemma: only apply system_prompt on non-model message

* py : minor fixes (#5668)

* Add codesphere ci pipeline

* Update readme

* Fix CI pipeline

* Ci stages use cublas if installed

---------

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Signed-off-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Paul Tsochantaris <ptsochantaris@icloud.com>
Co-authored-by: Eve <139727413+netrunnereve@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Sang-Kil Park <sang.park@42dot.ai>
Co-authored-by: Wu Jian Ping <wujjpp@hotmail.com>
Co-authored-by: divinity76 <divinity76@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Wu Jian Ping <wujp@greatld.com>
Co-authored-by: Romain Neutron <romain@neutron.io>
Co-authored-by: Vladimir Malyutin <first-leon@yandex.ru>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>
Co-authored-by: Jack Mousseau <jmousseau@users.noreply.github.com>
Co-authored-by: John Balis <phobossystems@gmail.com>
Co-authored-by: JohnnyB <jboero@users.noreply.github.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: Yiming Cui <conandiy@vip.qq.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: JidongZhang-THU <1119708529@qq.com>
Co-authored-by: Guoteng <32697156+SolenoidWGT@users.noreply.github.com>
Co-authored-by: Ali Nehzat <ali.nehzat@thanks.dev>
Co-authored-by: Ian Bull <irbull@gmail.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
Co-authored-by: Mirror Azure <54669636+MirrorAzure@users.noreply.github.com>
Co-authored-by: kalomaze <66376113+kalomaze@users.noreply.github.com>
Co-authored-by: BADR <contact@pythops.com>
Co-authored-by: Michael Klimenko <mklimenko29@gmail.com>
Co-authored-by: Martin Schwaighofer <mschwaig@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Welby Seely <welbyseely@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: chiranko <96988916+chiranko@users.noreply.github.com>
Co-authored-by: Нияз Гарифзянов <112617865+garrnizon@users.noreply.github.com>
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: Alexey Parfenov <zxed@alkatrazstudio.net>
Co-authored-by: Dr. Tom Murphy VII Ph.D <499244+tom7@users.noreply.github.com>
Co-authored-by: Niall Coates <1349685+Niall-@users.noreply.github.com>
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
Co-authored-by: Justin Parker <jparkerweb@gmail.com>
Co-authored-by: BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com>
Co-authored-by: runfuture <runfuture@users.noreply.github.com>
Co-authored-by: Ben Williams <ben@719ben.com>
Co-authored-by: Xiao-Yong Jin <jinxiaoyong@gmail.com>
Co-authored-by: Kamil Tomšík <info@tomsik.cz>
Co-authored-by: Ebey Abraham <ebey97@gmail.com>
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Marko Tasic <mtasic85@gmail.com>
Co-authored-by: Riley Stewart <ristew@users.noreply.github.com>
Co-authored-by: Neuman Vong <neuman.vong@gmail.com>
Co-authored-by: Ian Bull <irbull@eclipsesource.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: snadampal <87143774+snadampal@users.noreply.github.com>
Co-authored-by: Sergio López <slp@sinrega.org>
Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: Lee <44310445+lx200916@users.noreply.github.com>
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: John <78893154+cmp-nct@users.noreply.github.com>
Co-authored-by: AT <manyoso@users.noreply.github.com>
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Rune <43761327+Rune-AI@users.noreply.github.com>
Co-authored-by: Elbios <141279586+Elbios@users.noreply.github.com>
Co-authored-by: Michaël de Vries <vriesdemichael@gmail.com>
Co-authored-by: bmwl <brian.marshall@tolko.com>
Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Rőczey Barnabás <31726601+An0nie@users.noreply.github.com>
Co-authored-by: Herman Semenov <GermanAizek@yandex.ru>
Co-authored-by: clibdev <52199778+clibdev@users.noreply.github.com>
Co-authored-by: Ananta Bastola <anantarajbastola@gmail.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
Co-authored-by: Robey Holderith <robey@flaminglunchbox.net>
Co-authored-by: Mirko185 <mirkosig@gmail.com>
Co-authored-by: NawafAlansari <72708095+NawafAlansari@users.noreply.github.com>
Co-authored-by: valiray <133289098+valiray@users.noreply.github.com>
Co-authored-by: Didzis Gosko <didzis@users.noreply.github.com>
Co-authored-by: nopperl <54780682+nopperl@users.noreply.github.com>
Co-authored-by: Mathijs de Bruin <mathijs@mathijsfietst.nl>
Co-authored-by: Haoxiang Fei <tonyfettes@tonyfettes.com>
Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Dane Madsen <dane_madsen@hotmail.com>
Co-authored-by: CJ Pais <cj@cjpais.com>
Co-authored-by: postmasters <namnguyen@google.com>
Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Dat Quoc Nguyen <2412555+datquocnguyen@users.noreply.github.com>
Co-authored-by: Someone <sergei.kozlukov@aalto.fi>
Co-authored-by: Katha <katha@codesphere.com>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov#5436
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov#5436
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
jiahansu pushed a commit to WiseSync/whisper.cpp that referenced this issue Apr 17, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov/llama.cpp#5436
viktor-silakov pushed a commit to viktor-silakov/whisper_node_mic.cpp that referenced this issue May 11, 2024
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses ggerganov/llama.cpp#5436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant