Prompt eval time is counted twice #790

sw · 2023-04-05T19:59:57Z

Creating a new issue so this doesn't get forgotten:

@KASR posted a CSV of processing times in #603 (comment)

But the times don't add up: If you take the total time, and subtract the partial times that are supposed to add up to it, the result is all over the place:

The clue lies in the comment by @ggerganov :

@sw
After the mmap changes, the load time is incorrect:

llama.cpp/llama.cpp

Lines 1681 to 1685 in 6e7801d

// get a more accurate load time, upon first eval

if (!ctx->has_evaluated_once) {

ctx->t_load_us = ggml_time_us() - ctx->t_start_us;

ctx->has_evaluated_once = true;

}

Currently, the reported load time includes not only the page faults, but also the prompt eval time. So effectively, you get the negative number since the prompt eval time has been accounted 2 times.
We have to fix this.

Originally posted by @ggerganov in #603 (comment)

The text was updated successfully, but these errors were encountered:

goerch · 2023-07-21T10:38:34Z

The timing computations looks correct for me, tested with ggml (main branch). Can this issue be closed then?

ggerganov · 2023-07-21T13:04:27Z

It's still broken in llama.cpp because we do the following:

llama.cpp/llama.cpp

Lines 3504 to 3525 in 0db14fe

    
           int llama_eval( 
        
                   struct llama_context * ctx, 
        
                      const llama_token * tokens, 
        
                                    int   n_tokens, 
        
                                    int   n_past, 
        
                                    int   n_threads) { 
        
               if (!llama_eval_internal(*ctx, tokens, nullptr, n_tokens, n_past, n_threads, nullptr)) { 
        
                   fprintf(stderr, "%s: failed to eval\n", __func__); 
        
                   return 1; 
        
               } 
        
               // get a more accurate load time, upon first eval 
        
               // TODO: fix this 
        
               if (!ctx->has_evaluated_once) { 
        
                   ctx->t_load_us = ggml_time_us() - ctx->t_start_us; 
        
                   ctx->has_evaluated_once = true; 
        
               } 
        
               return 0; 
        
           }

The reason we do it like this is because when using mmap, the loading of the data happens as we need it. I.e. it will happen during the first llama_eval_internal() call. But this call also involves computations which should not be counted towards the "load" time.

You don't see this in ggml because it does not have mmap support there and this effect cannot be observed.

goerch · 2023-07-22T08:40:02Z

The reason we do it like this is because when using mmap, the loading of the data happens as we need it.

I'm not sure this is the only problem. If I understand

std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
    auto lparams = llama_context_params_from_gpt_params(params);

    llama_model * model  = llama_load_model_from_file(params.model.c_str(), lparams);
    if (model == NULL) {
        fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
        return std::make_tuple(nullptr, nullptr);
    }

    llama_context * lctx = llama_new_context_with_model(model, lparams);
    if (lctx == NULL) {
        fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, params.model.c_str());
        llama_free_model(model);
        return std::make_tuple(nullptr, nullptr);
    }

    ...

    return std::make_tuple(model, lctx);
}

correctly, model loading is decoupled from context creation here and we cant' access the timings from this function even when not using mmap?

A simple workaround could be to remove the offending code in llama_eval and compute the load time in llama_print_timings as total_time - eval_time - prompt_eval_time - sample_time?

Co-authored-by: Andrei <abetlen@gmail.com>

github-actions · 2024-04-11T01:07:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

sw mentioned this issue Apr 5, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

sw added the bug Something isn't working label Apr 6, 2023

This was referenced Apr 11, 2023

Alpaca model is running very slow in llama.cpp compared to alpaca.cpp #677

Closed

I'm pegging CPU (./examples/chat.sh works very slowly) on a 5800X3D / u22 linux, anything that can be done? #735

Closed

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

ggerganov#717: Add support for Huggingface Autotokenizer (ggerganov#790)

4ff8def

Co-authored-by: Andrei <abetlen@gmail.com>

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt eval time is counted twice #790

Prompt eval time is counted twice #790

sw commented Apr 5, 2023 •

edited

Loading

goerch commented Jul 21, 2023

ggerganov commented Jul 21, 2023

goerch commented Jul 22, 2023 •

edited

Loading

github-actions bot commented Apr 11, 2024

Prompt eval time is counted twice #790

Prompt eval time is counted twice #790

Comments

sw commented Apr 5, 2023 • edited Loading

goerch commented Jul 21, 2023

ggerganov commented Jul 21, 2023

goerch commented Jul 22, 2023 • edited Loading

github-actions bot commented Apr 11, 2024

sw commented Apr 5, 2023 •

edited

Loading

goerch commented Jul 22, 2023 •

edited

Loading