backend : add eval callback #4935

ggerganov · 2024-01-14T14:52:40Z

# Metal
make -j && ./simple ./models/llama-7b/ggml-model-q4_0.gguf "Hello, my name is" 1

# CUDA
LLAMA_CUBLAS=1 make -j && ./simple ./models/llama-7b/ggml-model-q4_0.gguf "Hello, my name is" 1

The callback currently observes the softmax results in the attention, but can be customized in any way:

llama.cpp/examples/simple/simple.cpp

Lines 9 to 34 in 81d4162

    
           // a function that can be called for every computed node during graph evaluation 
        
           // the user can choose to whether to observe the data of the node depending on the tensor parameters 
        
           static bool observe_compute(int node_index, struct ggml_tensor * t, bool ask, void * user_data) { 
        
               GGML_UNUSED(user_data); 
        
               // the scheduler is asking us if we want to observe this node 
        
               if (ask) { 
        
                   // check if name contains soft_max 
        
                   return strstr(t->name, "soft_max") != 0; 
        
               } 
        
               // print the node data 
        
               printf("%s: node_index = %5d, t->name = %32s, t->op = %12s, [%5d, %5d, %5d, %5d]\n", 
        
                       __func__, node_index, t->name, ggml_op_name(t->op), (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]); 
        
               std::vector<float> t_data(ggml_nelements(t)); 
        
               ggml_backend_tensor_get(t, t_data.data(), 0, ggml_nbytes(t)); 
        
               // print first row 
        
               for (int i = 0; i < t->ne[0]; i++) { 
        
                   printf("%8.4f ", t_data[i]); 
        
               } 
        
               printf("\n"); 
        
               return true; 
        
           }

Skip last CLI arg (or set to 0) to disable the callback

ggml-ci

examples/simple/simple.cpp

slaren · 2024-01-15T18:30:14Z

ggml-backend.c

+                if (sched->callback_eval(t, true,  sched->callback_eval_user_data) && // ask
+                   !sched->callback_eval(t, false, sched->callback_eval_user_data)) { // eval
+                    break;
+                }


Is the ask callback really necessary here?

I've changed the implementation to ask only once per node in a split

llama.cpp

slaren · 2024-01-16T22:29:46Z

ggml-backend.c

+    // TODO: should we clear the callbacks?
+    //sched->callback_eval = NULL;
+    //sched->callback_eval_user_data = NULL;
+


I think this is fine, we don't need to clear the callbacks here, the reset function is meant to prepare the sched for the next graph evaluation, resetting the allocators and the backend assignments (similar to ggml_allocr_reset).

Fixes ggerganov#5029 introduced in ggerganov#4935

* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo

ggerganov mentioned this pull request Jan 14, 2024

Add importance matrix calculation to non-CPU back-ends #4931

Closed

ggerganov added 3 commits January 15, 2024 16:24

backend : add eval callback

65648b3

ggml-ci

backend : group nodes in a single compute when user don't need them

01b6f68

backend : clean-up the implementation

83f3d7a

ggml-ci

ggerganov force-pushed the gg/sched-eval-callback-4931 branch from 40cdb39 to 83f3d7a Compare January 15, 2024 14:24

ggerganov marked this pull request as ready for review January 15, 2024 14:31

ggerganov requested a review from slaren January 15, 2024 14:31

ggerganov added 2 commits January 15, 2024 16:42

simple : do not perform tensor data copy if not needed

e1b1db9

simple : fix

e049380

ggerganov mentioned this pull request Jan 15, 2024

imatrix : offload to GPU support #4957

Merged

slaren reviewed Jan 15, 2024

View reviewed changes

examples/simple/simple.cpp Outdated Show resolved Hide resolved

slaren reviewed Jan 15, 2024

View reviewed changes

examples/simple/simple.cpp Outdated Show resolved Hide resolved

slaren reviewed Jan 15, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

ggerganov added 3 commits January 16, 2024 10:52

simple : no need for ggml_is_contiguous + fix bool parse

aa16b54

llama : fix callback placement in llama_context_params

0c96c72

backend : avoid double-ask callback calls

012ecec

slaren reviewed Jan 16, 2024

View reviewed changes

slaren approved these changes Jan 16, 2024

View reviewed changes

simple : restore examples, imatrix will serve as a demo

200dcaf

ggerganov added the sync Requires sync with the ggml repo after merging label Jan 17, 2024

ggerganov merged commit 44a1a4a into master Jan 17, 2024
38 of 47 checks passed

ggerganov deleted the gg/sched-eval-callback-4931 branch January 17, 2024 16:39

slaren mentioned this pull request Jan 18, 2024

ggml_allocr_alloc_graph allocated overlapping tensor memory ggerganov/ggml#700

Closed

brittlewis12 mentioned this pull request Jan 18, 2024

Swift/Xcode build errors since #4935: spm-headers/llama.h:235:9 Unknown type name 'ggml_backend_sched_eval_callback' #5029

Closed

brittlewis12 added a commit to brittlewis12/llama.cpp that referenced this pull request Jan 18, 2024

Resolve ggml_backend_sched_eval_callback visibility

633c502

Fixes ggerganov#5029 introduced in ggerganov#4935

brittlewis12 mentioned this pull request Jan 18, 2024

Resolve ggml_backend_sched_eval_callback visibility (#5029) #5030

Closed

slaren mentioned this pull request Jan 23, 2024

Request: Allow for adjustments at the layer-level, for a practically two-fold increase in LLM handling ability by prompters #4843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend : add eval callback #4935

backend : add eval callback #4935

ggerganov commented Jan 14, 2024 •

edited

slaren Jan 15, 2024

ggerganov Jan 16, 2024

slaren Jan 16, 2024

	// a function that can be called for every computed node during graph evaluation
	// the user can choose to whether to observe the data of the node depending on the tensor parameters
	static bool observe_compute(int node_index, struct ggml_tensor * t, bool ask, void * user_data) {
	GGML_UNUSED(user_data);

	// the scheduler is asking us if we want to observe this node
	if (ask) {
	// check if name contains soft_max
	return strstr(t->name, "soft_max") != 0;
	}

	// print the node data
	printf("%s: node_index = %5d, t->name = %32s, t->op = %12s, [%5d, %5d, %5d, %5d]\n",
	__func__, node_index, t->name, ggml_op_name(t->op), (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]);

	std::vector<float> t_data(ggml_nelements(t));
	ggml_backend_tensor_get(t, t_data.data(), 0, ggml_nbytes(t));

	// print first row
	for (int i = 0; i < t->ne[0]; i++) {
	printf("%8.4f ", t_data[i]);
	}
	printf("\n");

	return true;
	}

backend : add eval callback #4935

backend : add eval callback #4935

Conversation

ggerganov commented Jan 14, 2024 • edited

slaren Jan 15, 2024

Choose a reason for hiding this comment

ggerganov Jan 16, 2024

Choose a reason for hiding this comment

slaren Jan 16, 2024

Choose a reason for hiding this comment

ggerganov commented Jan 14, 2024 •

edited