IMPORTANT: Introduce C-style API - Major Refactoring #370

ggerganov · 2023-03-21T20:51:40Z

This is pretty big change, but it has to happen in order to allow building more examples by reusing the code properly. It will also allow other programming languages to interface with the code.

Moved ggml_quantize_q4_0() and ggml_quantize_q4_1() from utils to ggml
Added llama.h and llama.cpp that implement most of the model handling that was previously in main.cpp
Moved the tokenizer code into the new llama library
Model quantization is now also inside the new llama library, exposed through the API
Almost all of the changes are simply moving code around, no new stuff

Please test that everything is working. I haven't tested the perplexity parameter at all.
Sorry if this conflicts with your ongoing work!

Green-Sky · 2023-03-21T21:08:29Z

our lord and saviour @ggerganov has answered our prayers ! 😄

anzz1 · 2023-03-21T21:15:03Z

!!!! This is exactly what we we're looking for.

While I was talking the talk in various PR's and discussions, you were already walking the walk. Kudos !

blackle · 2023-03-21T21:54:07Z

llama.h

+    // TODO: show sample usage
+    //
+
+    struct llama_context;


It would be really cool if the memory in the llama context and the actual model parameters were in separate structs, so you could initialize a model from a file, then create multiple contexts from it and inference them in parallel. this would be useful if you want to have multiple prompts going at the same time so they can all share the model weights, but have their own k/v memory

I think it would also be useful if the thread safety of each function was mentioned. For example, can I call llama_eval() and llama_tokenize() concurrently?

Yes, this will be added via llama_state similar to the way it is implemented in whisper.cpp.
I didn't include it in the first pass of the implementation in order to deal with a smaller change. Will probably add this on Thursday with a second pass. It will not change the existing API - it will simply extend it with new per-state calls.

sw · 2023-03-21T22:30:17Z

ggml.c

+
+        for (int i = 0; i < nb; i++) {
+            float min = FLT_MAX;
+            float max = -FLT_MAX;


This went from

float max = std::numeric_limits<float>::min();

to

float max = -FLT_MAX;

Can't really argue against it, but mentioning it because it's easy to lose these details...

Exactly as it should be 👍

Is FLT_MAX standard C++? Edit: Yes, part of limits.h

j-f1 · 2023-03-21T22:46:05Z

llama.h

+    LLAMA_API llama_token llama_token_bos();
+    LLAMA_API llama_token llama_token_eos();


Suggestion: expand the BOS and EOS acronyms here

BOS/EOS abbreviations are used by SentencePiece itself (and in other tokenizers). I think the acronyms are probably fine in context here.

Green-Sky · 2023-03-21T23:14:04Z

main.cpp

+        auto lparams = llama_context_default_params();
+
+        lparams.f16_kv = params.memory_f16;
+        lparams.logits_all = params.perplexity;


missing parameters:

lparams.n_parts = params.n_parts; lparams.n_ctx = params.n_ctx;

They are initialized with defaults from the llama_context_default_params() call above

Not copying these values from params breaks their respective command line switches (--n_parts and -c).

Oops - missed that!

Green-Sky · 2023-03-21T23:18:49Z

alpaca 30B q4_0 opinion:

> this is the new c-api. tell me how you like it
I like this C API. It's easy to use and intuitive.

Green-Sky

trying to run the --perplexity option:

llama_tokenize: too many tokens
perplexity : calculating perplexity over 0 chunks

$ wc -m wikitext-2-raw/wiki.test.raw
1288556 wikitext-2-raw/wiki.test.raw

Green-Sky · 2023-03-21T23:51:40Z

main.cpp

-
-    return s.c_str();
-}
-
 int main(int argc, char ** argv) {
    ggml_time_init();
    const int64_t t_main_start_us = ggml_time_us();


the t_main_start_us variable is now unused, and i think it needs to be replaced with a call to llama_reset_timings(ctx)

actually, ggml_time_init() too. thatone moved to llama_init_from_file(), llama_reset_timings(ctx) should likely be there too.

t_sample_us and t_predict_us in lines 229-230 are also unused now.

Should be resolved now

slaren · 2023-03-22T00:39:55Z

llama.cpp

+                         int   n_tokens,
+                         int   n_past,
+                         int   n_threads) {
+    if (!llama_eval_internal(*ctx, tokens, n_tokens, n_past, n_threads)) {


Evaluation timings are broken. ctx->n_eval should be increased here and this function call should be timed and added to ctx->t_eval_us.

slaren · 2023-03-22T00:49:46Z

llama.cpp

+            temp,
+            repeat_penalty);
+
+    ctx->t_sample_us += ggml_time_us() - t_start_sample_us;


ctx->n_sample should also be increased here.

slaren · 2023-03-22T01:04:33Z

llama.cpp

+
+    if (n_max_tokens < (int) res.size()) {
+        fprintf(stderr, "%s: too many tokens\n", __func__);
+        return 1;


Returning 1 here is ambiguous, it is impossible to tell if there was an error or if only 1 token was returned. Consider returning a negative number to indicate failure (maybe the number of tokens negated?).

According to the documentation in llama.h it should return -1 on failure. However, may be necessary to rethink or add other API to support cases like the perplexity computation in which a big document has to be tokenized.

Good catch. Lets fix the perplexity computation on master to avoid blocking the development for too long.
With the new API change that you suggested, we can now query for the number of needed tokens and allocate the necessary memory. Some improvements might also be possible

ggerganov · 2023-03-22T05:32:14Z

Thank you all for the detailed look!
Will merge this now and feel free to resolve any remaining issues and improve the documentation.
The llama_state + support for parallel generations will be added likely on Thursday. The existing API should not change - only get extended.

TODO: The perplexity computation needs a small fix - see comment above by @slaren

fix: Make LLamaState pickleable for disk cache

ggerganov added 5 commits March 21, 2023 22:40

Major refactoring - introduce C-style API

9af8f79

Clean up

f9d4a0e

Add <cassert>

90d07b5

Add <iterator>

cae6e8a

Add <algorithm> ....

4d2e035

ggerganov mentioned this pull request Mar 21, 2023

Create a C-style API similar to whisper.cpp #77

Closed

anzz1 mentioned this pull request Mar 21, 2023

Ability for ./main to keep the model in memory and pass it more text #23

Closed

Green-Sky added the high priority Very important issue label Mar 21, 2023

gjmulder added the enhancement New feature or request label Mar 21, 2023

anzz1 mentioned this pull request Mar 21, 2023

MMAP for Windows (not working atm) #341

Closed

blackle reviewed Mar 21, 2023

View reviewed changes

sw reviewed Mar 21, 2023

View reviewed changes

j-f1 reviewed Mar 21, 2023

View reviewed changes

Green-Sky requested changes Mar 21, 2023

View reviewed changes

slaren reviewed Mar 22, 2023

View reviewed changes

ggerganov added 3 commits March 22, 2023 07:17

Fix timing reporting and accumulation

71ed3d2

Measure eval time only for single-token calls

a9f900b

Change llama_tokenize return meaning

c3d13ea

ggerganov merged commit f5a77a6 into master Mar 22, 2023

ggerganov deleted the c-api branch March 22, 2023 05:32

ggerganov mentioned this pull request Mar 22, 2023

Introduce structs for the q4 data blocks #356

Merged

ggerganov added a commit that referenced this pull request Mar 22, 2023

Init llama_context_params properly from CLI (#370)

928480e

madmads11 mentioned this pull request Mar 22, 2023

Initial support for llama.cpp oobabooga/text-generation-webui#447

Merged

ggerganov mentioned this pull request Mar 22, 2023

Compute perplexity fails with too many tokens exception #385

Closed

4 tasks

jart mentioned this pull request Mar 28, 2023

Should use mmap for model loading #91

Closed

Loufe mentioned this pull request Mar 29, 2023

Support for fastLLaMa? oobabooga/text-generation-webui#575

Closed

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggerganov#370 from Okabintaro/fix-state-pickle

628e3fb

fix: Make LLamaState pickleable for disk cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORTANT: Introduce C-style API - Major Refactoring #370

IMPORTANT: Introduce C-style API - Major Refactoring #370

ggerganov commented Mar 21, 2023 •

edited

Loading

Green-Sky commented Mar 21, 2023

anzz1 commented Mar 21, 2023

blackle Mar 21, 2023

Ronsor Mar 21, 2023

ggerganov Mar 22, 2023

sw Mar 21, 2023

anzz1 Mar 21, 2023

niansa Mar 21, 2023 •

edited

Loading

j-f1 Mar 21, 2023

Ronsor Mar 21, 2023

Green-Sky Mar 21, 2023

ggerganov Mar 22, 2023

slaren Mar 22, 2023

ggerganov Mar 22, 2023

Green-Sky commented Mar 21, 2023

Green-Sky left a comment •

edited

Loading

Green-Sky Mar 21, 2023 •

edited

Loading

Green-Sky Mar 22, 2023

slaren Mar 22, 2023

ggerganov Mar 22, 2023

slaren Mar 22, 2023 •

edited

Loading

slaren Mar 22, 2023

slaren Mar 22, 2023

slaren Mar 22, 2023

ggerganov Mar 22, 2023

ggerganov commented Mar 22, 2023

		LLAMA_API llama_token llama_token_bos();
		LLAMA_API llama_token llama_token_eos();

IMPORTANT: Introduce C-style API - Major Refactoring #370

IMPORTANT: Introduce C-style API - Major Refactoring #370

Conversation

ggerganov commented Mar 21, 2023 • edited Loading

Green-Sky commented Mar 21, 2023

anzz1 commented Mar 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niansa Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Green-Sky commented Mar 21, 2023

Green-Sky left a comment • edited Loading

Choose a reason for hiding this comment

Green-Sky Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Mar 22, 2023

ggerganov commented Mar 21, 2023 •

edited

Loading

niansa Mar 21, 2023 •

edited

Loading

Green-Sky left a comment •

edited

Loading

Green-Sky Mar 21, 2023 •

edited

Loading

slaren Mar 22, 2023 •

edited

Loading