Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt eval time is counted twice #790

Closed
sw opened this issue Apr 5, 2023 · 4 comments
Closed

Prompt eval time is counted twice #790

sw opened this issue Apr 5, 2023 · 4 comments
Labels
bug Something isn't working stale

Comments

@sw
Copy link
Collaborator

sw commented Apr 5, 2023

Creating a new issue so this doesn't get forgotten:

@KASR posted a CSV of processing times in #603 (comment)

But the times don't add up: If you take the total time, and subtract the partial times that are supposed to add up to it, the result is all over the place:

image

The clue lies in the comment by @ggerganov :

@sw
After the mmap changes, the load time is incorrect:

llama.cpp/llama.cpp

Lines 1681 to 1685 in 6e7801d

// get a more accurate load time, upon first eval
if (!ctx->has_evaluated_once) {
ctx->t_load_us = ggml_time_us() - ctx->t_start_us;
ctx->has_evaluated_once = true;
}

Currently, the reported load time includes not only the page faults, but also the prompt eval time. So effectively, you get the negative number since the prompt eval time has been accounted 2 times.
We have to fix this.

Originally posted by @ggerganov in #603 (comment)

@goerch
Copy link
Collaborator

goerch commented Jul 21, 2023

The timing computations looks correct for me, tested with ggml (main branch). Can this issue be closed then?

@ggerganov
Copy link
Owner

It's still broken in llama.cpp because we do the following:

llama.cpp/llama.cpp

Lines 3504 to 3525 in 0db14fe

int llama_eval(
struct llama_context * ctx,
const llama_token * tokens,
int n_tokens,
int n_past,
int n_threads) {
if (!llama_eval_internal(*ctx, tokens, nullptr, n_tokens, n_past, n_threads, nullptr)) {
fprintf(stderr, "%s: failed to eval\n", __func__);
return 1;
}
// get a more accurate load time, upon first eval
// TODO: fix this
if (!ctx->has_evaluated_once) {
ctx->t_load_us = ggml_time_us() - ctx->t_start_us;
ctx->has_evaluated_once = true;
}
return 0;
}

The reason we do it like this is because when using mmap, the loading of the data happens as we need it. I.e. it will happen during the first llama_eval_internal() call. But this call also involves computations which should not be counted towards the "load" time.

You don't see this in ggml because it does not have mmap support there and this effect cannot be observed.

@goerch
Copy link
Collaborator

goerch commented Jul 22, 2023

The reason we do it like this is because when using mmap, the loading of the data happens as we need it.

I'm not sure this is the only problem. If I understand

std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
    auto lparams = llama_context_params_from_gpt_params(params);

    llama_model * model  = llama_load_model_from_file(params.model.c_str(), lparams);
    if (model == NULL) {
        fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
        return std::make_tuple(nullptr, nullptr);
    }

    llama_context * lctx = llama_new_context_with_model(model, lparams);
    if (lctx == NULL) {
        fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, params.model.c_str());
        llama_free_model(model);
        return std::make_tuple(nullptr, nullptr);
    }

    ...

    return std::make_tuple(model, lctx);
}

correctly, model loading is decoupled from context creation here and we cant' access the timings from this function even when not using mmap?

A simple workaround could be to remove the offending code in llama_eval and compute the load time in llama_print_timings as total_time - eval_time - prompt_eval_time - sample_time?

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants