Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./llama2_q4: No such file or directory #6

Open
mauridev777 opened this issue Sep 16, 2023 · 8 comments
Open

./llama2_q4: No such file or directory #6

mauridev777 opened this issue Sep 16, 2023 · 8 comments

Comments

@mauridev777
Copy link

Hello, trying to run this in a google colab notebook and I get : "./llama2_q4: No such file or directory"
when running : !./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
this can be used on colab? thanks

@mauridev777
Copy link
Author

using : !./llama2_q4.cu llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
gives error: ./llama2_q4.cu: Permission denied

@GilesBathgate
Copy link
Contributor

GilesBathgate commented Nov 25, 2023

If you've compiled it using the instructions then it will be in build/llama2_q4

You can probably just do !ln -s build/llama2_q4 and then it will work as instructed.

@iamsiddhantsahu
Copy link

@GilesBathgate @ankan-ban cmake --build . --config Release is giving me error

[ 25%] Building CXX object CMakeFiles/weight_packer.dir/weight_packer.cpp.o
/datasets/sisahu/llama_cu_awq/weight_packer.cpp: In function ‘int main(int, char**)’:
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:30: warning: ‘.input_layernorm.weight.bin’ directive writing 27 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:16: note: ‘sprintf’ output between 28 and 539 bytes into a destination of size 512
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:30: warning: ‘.post_attention_layernorm.we...’ directive writing 36 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:16: note: ‘sprintf’ output between 37 and 548 bytes into a destination of size 512
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 50%] Linking CXX executable weight_packer
[ 50%] Built target weight_packer
[ 75%] Building CUDA object CMakeFiles/llama2_q4.dir/llama2_q4.cu.o
/datasets/sisahu/llama_cu_awq/llama2_q4.cu(368): error: too few arguments in function call

1 error detected in the compilation of "/datasets/sisahu/llama_cu_awq/llama2_q4.cu".
make[2]: *** [CMakeFiles/llama2_q4.dir/build.make:76: CMakeFiles/llama2_q4.dir/llama2_q4.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:111: CMakeFiles/llama2_q4.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Do you know what could be the error?

@GilesBathgate
Copy link
Contributor

@iamsiddhantsahu I think you have an old cuda driver, or wrong version.

You could try to change the line #define USE_CUDA_GRAPHS 1 to #define USE_CUDA_GRAPHS 0 in /datasets/sisahu/llama_cu_awq/llama2_q4.cu

@iamsiddhantsahu
Copy link

@GilesBathgate Many thanks for the answer -- yes indeed you were right setting #define USE_CUDA_GRAPHS 0 did made it compile.

Then, I would assume this would not run this part of the code -- which seems do be CUDA Graph API -- does it mean that my network would be less optimized?

    int graphIndex;
    int seq_len_bin = 128;
    for (graphIndex = 0; graphIndex < MAX_GRAPHS - 1; seq_len_bin *= 2, graphIndex++)
        if (seq_len <= seq_len_bin) break;
    if ((seq_len > seq_len_bin) || (graphIndex == MAX_GRAPHS - 1)) seq_len_bin = p->seq_len;    // last bin holds max seq len

    if (!graphCaptured[graphIndex])
    {
        cudaGraph_t graph = {};
        cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
        run_llama_network(s->pos, p, s, w, seq_len_bin);
        cudaStreamEndCapture(stream, &graph);
        cudaGraphInstantiate(&cudaGraphInstance[graphIndex], graph, 0);
        cudaGraphDestroy(graph);
        graphCaptured[graphIndex] = true;
    }
    cudaGraphLaunch(cudaGraphInstance[graphIndex], stream);

Another thing is -- I wanted to incorporate FlashAttention to this code base, and see if it gives me better tokens per second. Any thoughts on that? @GilesBathgate and @ankan-ban

@GilesBathgate
Copy link
Contributor

@iamsiddhantsahu I don't know what improvement using cudaGraph is intended to achieve. I find the tok/s adequate without it. Regarding speed the KV-cache already improves the speed for inference. FlashAttention might add some gains but could take considerable effort to implement from scratch.

@iamsiddhantsahu
Copy link

@GilesBathgate I just tried it with on the NVIDIA Jetson Orin (64 GB) -- I am getting about 30 tokens/sec with this implementation -- and I also tried the GGML implementation https://github.com/ggerganov/llama.cpp -- this is giving performance of about 250-300 tokens/sec

@GilesBathgate
Copy link
Contributor

@iamsiddhantsahu I think Llama2_cu is simple and for educational purposes. See also https://github.com/karpathy/llm.c . GGML is better supported and has flashattention ggerganov/llama.cpp#5021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants