-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
./llama2_q4: No such file or directory #6
Comments
using : !./llama2_q4.cu llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs" |
If you've compiled it using the instructions then it will be in You can probably just do |
@GilesBathgate @ankan-ban [ 25%] Building CXX object CMakeFiles/weight_packer.dir/weight_packer.cpp.o
/datasets/sisahu/llama_cu_awq/weight_packer.cpp: In function ‘int main(int, char**)’:
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:30: warning: ‘.input_layernorm.weight.bin’ directive writing 27 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
287 | sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:16: note: ‘sprintf’ output between 28 and 539 bytes into a destination of size 512
287 | sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:30: warning: ‘.post_attention_layernorm.we...’ directive writing 36 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
290 | sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:16: note: ‘sprintf’ output between 37 and 548 bytes into a destination of size 512
290 | sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 50%] Linking CXX executable weight_packer
[ 50%] Built target weight_packer
[ 75%] Building CUDA object CMakeFiles/llama2_q4.dir/llama2_q4.cu.o
/datasets/sisahu/llama_cu_awq/llama2_q4.cu(368): error: too few arguments in function call
1 error detected in the compilation of "/datasets/sisahu/llama_cu_awq/llama2_q4.cu".
make[2]: *** [CMakeFiles/llama2_q4.dir/build.make:76: CMakeFiles/llama2_q4.dir/llama2_q4.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:111: CMakeFiles/llama2_q4.dir/all] Error 2
make: *** [Makefile:91: all] Error 2 Do you know what could be the error? |
@iamsiddhantsahu I think you have an old cuda driver, or wrong version. You could try to change the line |
@GilesBathgate Many thanks for the answer -- yes indeed you were right setting Then, I would assume this would not run this part of the code -- which seems do be CUDA Graph API -- does it mean that my network would be less optimized? int graphIndex;
int seq_len_bin = 128;
for (graphIndex = 0; graphIndex < MAX_GRAPHS - 1; seq_len_bin *= 2, graphIndex++)
if (seq_len <= seq_len_bin) break;
if ((seq_len > seq_len_bin) || (graphIndex == MAX_GRAPHS - 1)) seq_len_bin = p->seq_len; // last bin holds max seq len
if (!graphCaptured[graphIndex])
{
cudaGraph_t graph = {};
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
run_llama_network(s->pos, p, s, w, seq_len_bin);
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&cudaGraphInstance[graphIndex], graph, 0);
cudaGraphDestroy(graph);
graphCaptured[graphIndex] = true;
}
cudaGraphLaunch(cudaGraphInstance[graphIndex], stream); Another thing is -- I wanted to incorporate FlashAttention to this code base, and see if it gives me better tokens per second. Any thoughts on that? @GilesBathgate and @ankan-ban |
@iamsiddhantsahu I don't know what improvement using cudaGraph is intended to achieve. I find the tok/s adequate without it. Regarding speed the KV-cache already improves the speed for inference. FlashAttention might add some gains but could take considerable effort to implement from scratch. |
@GilesBathgate I just tried it with on the |
@iamsiddhantsahu I think |
Hello, trying to run this in a google colab notebook and I get : "./llama2_q4: No such file or directory"
when running : !./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
this can be used on colab? thanks
The text was updated successfully, but these errors were encountered: