./llama2_q4: No such file or directory #6

mauridev777 · 2023-09-16T12:55:26Z

Hello, trying to run this in a google colab notebook and I get : "./llama2_q4: No such file or directory"
when running : !./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
this can be used on colab? thanks

mauridev777 · 2023-09-16T12:57:14Z

using : !./llama2_q4.cu llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
gives error: ./llama2_q4.cu: Permission denied

GilesBathgate · 2023-11-25T13:01:04Z

If you've compiled it using the instructions then it will be in build/llama2_q4

You can probably just do !ln -s build/llama2_q4 and then it will work as instructed.

iamsiddhantsahu · 2024-07-16T00:56:07Z

@GilesBathgate @ankan-ban cmake --build . --config Release is giving me error

[ 25%] Building CXX object CMakeFiles/weight_packer.dir/weight_packer.cpp.o
/datasets/sisahu/llama_cu_awq/weight_packer.cpp: In function ‘int main(int, char**)’:
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:30: warning: ‘.input_layernorm.weight.bin’ directive writing 27 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:16: note: ‘sprintf’ output between 28 and 539 bytes into a destination of size 512
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:30: warning: ‘.post_attention_layernorm.we...’ directive writing 36 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:16: note: ‘sprintf’ output between 37 and 548 bytes into a destination of size 512
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 50%] Linking CXX executable weight_packer
[ 50%] Built target weight_packer
[ 75%] Building CUDA object CMakeFiles/llama2_q4.dir/llama2_q4.cu.o
/datasets/sisahu/llama_cu_awq/llama2_q4.cu(368): error: too few arguments in function call

1 error detected in the compilation of "/datasets/sisahu/llama_cu_awq/llama2_q4.cu".
make[2]: *** [CMakeFiles/llama2_q4.dir/build.make:76: CMakeFiles/llama2_q4.dir/llama2_q4.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:111: CMakeFiles/llama2_q4.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Do you know what could be the error?

GilesBathgate · 2024-07-16T05:01:05Z

@iamsiddhantsahu I think you have an old cuda driver, or wrong version.

You could try to change the line #define USE_CUDA_GRAPHS 1 to #define USE_CUDA_GRAPHS 0 in /datasets/sisahu/llama_cu_awq/llama2_q4.cu

iamsiddhantsahu · 2024-07-16T05:35:38Z

@GilesBathgate Many thanks for the answer -- yes indeed you were right setting #define USE_CUDA_GRAPHS 0 did made it compile.

Then, I would assume this would not run this part of the code -- which seems do be CUDA Graph API -- does it mean that my network would be less optimized?

    int graphIndex;
    int seq_len_bin = 128;
    for (graphIndex = 0; graphIndex < MAX_GRAPHS - 1; seq_len_bin *= 2, graphIndex++)
        if (seq_len <= seq_len_bin) break;
    if ((seq_len > seq_len_bin) || (graphIndex == MAX_GRAPHS - 1)) seq_len_bin = p->seq_len;    // last bin holds max seq len

    if (!graphCaptured[graphIndex])
    {
        cudaGraph_t graph = {};
        cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
        run_llama_network(s->pos, p, s, w, seq_len_bin);
        cudaStreamEndCapture(stream, &graph);
        cudaGraphInstantiate(&cudaGraphInstance[graphIndex], graph, 0);
        cudaGraphDestroy(graph);
        graphCaptured[graphIndex] = true;
    }
    cudaGraphLaunch(cudaGraphInstance[graphIndex], stream);

Another thing is -- I wanted to incorporate FlashAttention to this code base, and see if it gives me better tokens per second. Any thoughts on that? @GilesBathgate and @ankan-ban

GilesBathgate · 2024-07-16T05:50:31Z

@iamsiddhantsahu I don't know what improvement using cudaGraph is intended to achieve. I find the tok/s adequate without it. Regarding speed the KV-cache already improves the speed for inference. FlashAttention might add some gains but could take considerable effort to implement from scratch.

iamsiddhantsahu · 2024-07-16T05:58:53Z

@GilesBathgate I just tried it with on the NVIDIA Jetson Orin (64 GB) -- I am getting about 30 tokens/sec with this implementation -- and I also tried the GGML implementation https://github.com/ggerganov/llama.cpp -- this is giving performance of about 250-300 tokens/sec

GilesBathgate · 2024-07-16T06:10:17Z

@iamsiddhantsahu I think Llama2_cu is simple and for educational purposes. See also https://github.com/karpathy/llm.c . GGML is better supported and has flashattention ggerganov/llama.cpp#5021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

./llama2_q4: No such file or directory #6

./llama2_q4: No such file or directory #6

mauridev777 commented Sep 16, 2023

mauridev777 commented Sep 16, 2023

GilesBathgate commented Nov 25, 2023 •

edited

Loading

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

./llama2_q4: No such file or directory #6

./llama2_q4: No such file or directory #6

Comments

mauridev777 commented Sep 16, 2023

mauridev777 commented Sep 16, 2023

GilesBathgate commented Nov 25, 2023 • edited Loading

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

iamsiddhantsahu commented Jul 16, 2024

GilesBathgate commented Jul 16, 2024

GilesBathgate commented Nov 25, 2023 •

edited

Loading