Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORTANT: Introduce C-style API - Major Refactoring #370

Merged
merged 8 commits into from
Mar 22, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 21, 2023

This is pretty big change, but it has to happen in order to allow building more examples by reusing the code properly. It will also allow other programming languages to interface with the code.

  • Moved ggml_quantize_q4_0() and ggml_quantize_q4_1() from utils to ggml
  • Added llama.h and llama.cpp that implement most of the model handling that was previously in main.cpp
  • Moved the tokenizer code into the new llama library
  • Model quantization is now also inside the new llama library, exposed through the API
  • Almost all of the changes are simply moving code around, no new stuff

Please test that everything is working. I haven't tested the perplexity parameter at all.
Sorry if this conflicts with your ongoing work!

@Green-Sky
Copy link
Collaborator

our lord and saviour @ggerganov has answered our prayers ! 😄

@anzz1
Copy link
Contributor

anzz1 commented Mar 21, 2023

!!!! This is exactly what we we're looking for.

While I was talking the talk in various PR's and discussions, you were already walking the walk. Kudos !

@Green-Sky Green-Sky added the high priority Very important issue label Mar 21, 2023
@gjmulder gjmulder added the enhancement New feature or request label Mar 21, 2023
// TODO: show sample usage
//

struct llama_context;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really cool if the memory in the llama context and the actual model parameters were in separate structs, so you could initialize a model from a file, then create multiple contexts from it and inference them in parallel. this would be useful if you want to have multiple prompts going at the same time so they can all share the model weights, but have their own k/v memory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be useful if the thread safety of each function was mentioned. For example, can I call llama_eval() and llama_tokenize() concurrently?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be added via llama_state similar to the way it is implemented in whisper.cpp.
I didn't include it in the first pass of the implementation in order to deal with a smaller change. Will probably add this on Thursday with a second pass. It will not change the existing API - it will simply extend it with new per-state calls.


for (int i = 0; i < nb; i++) {
float min = FLT_MAX;
float max = -FLT_MAX;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went from

float max = std::numeric_limits<float>::min();

to

float max = -FLT_MAX;

Can't really argue against it, but mentioning it because it's easy to lose these details...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly as it should be 👍

Copy link
Contributor

@niansa niansa Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is FLT_MAX standard C++? Edit: Yes, part of limits.h

Comment on lines +115 to +116
LLAMA_API llama_token llama_token_bos();
LLAMA_API llama_token llama_token_eos();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: expand the BOS and EOS acronyms here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOS/EOS abbreviations are used by SentencePiece itself (and in other tokenizers). I think the acronyms are probably fine in context here.

auto lparams = llama_context_default_params();

lparams.f16_kv = params.memory_f16;
lparams.logits_all = params.perplexity;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing parameters:

        lparams.n_parts = params.n_parts;
        lparams.n_ctx = params.n_ctx;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are initialized with defaults from the llama_context_default_params() call above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not copying these values from params breaks their respective command line switches (--n_parts and -c).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - missed that!

@Green-Sky
Copy link
Collaborator

alpaca 30B q4_0 opinion:

> this is the new c-api. tell me how you like it
I like this C API. It's easy to use and intuitive.

Copy link
Collaborator

@Green-Sky Green-Sky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to run the --perplexity option:

llama_tokenize: too many tokens
perplexity : calculating perplexity over 0 chunks
$ wc -m wikitext-2-raw/wiki.test.raw
1288556 wikitext-2-raw/wiki.test.raw

main.cpp Outdated

return s.c_str();
}

int main(int argc, char ** argv) {
ggml_time_init();
const int64_t t_main_start_us = ggml_time_us();
Copy link
Collaborator

@Green-Sky Green-Sky Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the t_main_start_us variable is now unused, and i think it needs to be replaced with a call to llama_reset_timings(ctx)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, ggml_time_init() too. thatone moved to llama_init_from_file(), llama_reset_timings(ctx) should likely be there too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t_sample_us and t_predict_us in lines 229-230 are also unused now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved now

int n_tokens,
int n_past,
int n_threads) {
if (!llama_eval_internal(*ctx, tokens, n_tokens, n_past, n_threads)) {
Copy link
Collaborator

@slaren slaren Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluation timings are broken. ctx->n_eval should be increased here and this function call should be timed and added to ctx->t_eval_us.

temp,
repeat_penalty);

ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx->n_sample should also be increased here.

llama.cpp Outdated

if (n_max_tokens < (int) res.size()) {
fprintf(stderr, "%s: too many tokens\n", __func__);
return 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning 1 here is ambiguous, it is impossible to tell if there was an error or if only 1 token was returned. Consider returning a negative number to indicate failure (maybe the number of tokens negated?).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation in llama.h it should return -1 on failure. However, may be necessary to rethink or add other API to support cases like the perplexity computation in which a big document has to be tokenized.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Lets fix the perplexity computation on master to avoid blocking the development for too long.
With the new API change that you suggested, we can now query for the number of needed tokens and allocate the necessary memory. Some improvements might also be possible

@ggerganov
Copy link
Owner Author

Thank you all for the detailed look!
Will merge this now and feel free to resolve any remaining issues and improve the documentation.
The llama_state + support for parallel generations will be added likely on Thursday. The existing API should not change - only get extended.

TODO: The perplexity computation needs a small fix - see comment above by @slaren

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
fix: Make LLamaState pickleable for disk cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants