Implement automatic NGL detection #6502

SleepyYui · 2024-04-05T10:09:56Z

This is not ready for merging; I still want to change/improve some stuff.

I implemented the option to pass "a" or "auto" with the -ngl parameter to automatically detect the maximum amount of layers that fit into the VRAM. Some stuff is still hard-coded or implemented weirdly; I'll improve that in the next commit(s).

Feedback is most definitely appreciated.

SleepyYui · 2024-04-05T10:18:30Z

When loading a 13B model into a 6 gig GPU

./main -m models/WV13B/Wizard-Vicuna-13B.Q5_K_M.gguf -ngl a
Log start
main: build = 2612 (1e66c3a7)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1712312099
...
llm_load_print_meta: model size       = 8.60 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = ehartford_wizard-vicuna-13b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: automatically set n_gpu_layers = 22    // Chose 22 layers since 23 
                                                         // are already too much to handle
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/41 layers to GPU
llm_load_tensors:        CPU buffer size =  8801.63 MiB
llm_load_tensors:      CUDA0 buffer size =  4711.31 MiB

phymbert · 2024-04-05T10:25:45Z

common/common.cpp

@@ -836,7 +836,12 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
            invalid_param = true;
            return true;
        }
-        params.n_gpu_layers = std::stoi(argv[i]);
+        std::string argValue = argv[i];
+        if (argValue == "auto" || argValue == "a") {


I agree it can be a breaking change, but I would prefer to have this approach as the default. i.e. if -ngl is not passed: automatically offload the maximum possible layers to VRAM.

Could be. If someone doesn't want that, they could simply -ngl 0 or just not compile with GPU args passed.

Yes but this is just my personal point of view. @ggerganov or @slaren would have a better global view

github-actions · 2024-04-05T10:50:04Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 442 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10663.68ms p(95)=27346.17ms fails=, finish reason: stop=386 truncated=56
Prompt processing (pp): avg=116.61tk/s p(95)=566.61tk/s
Token generation (tg): avg=23.52tk/s p(95)=36.14tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=2549662cde1861755515b75bc5e2a1b83f62d63b

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 575.08, 575.08, 575.08, 575.08, 575.08, 474.9, 474.9, 474.9, 474.9, 474.9, 493.73, 493.73, 493.73, 493.73, 493.73, 521.57, 521.57, 521.57, 521.57, 521.57, 551.82, 551.82, 551.82, 551.82, 551.82, 549.79, 549.79, 549.79, 549.79, 549.79, 550.23, 550.23, 550.23, 550.23, 550.23, 553.51, 553.51, 553.51, 553.51, 553.51, 579.91, 579.91, 579.91, 579.91, 579.91, 588.31, 588.31, 588.31, 588.31, 588.31, 589.67, 589.67, 589.67, 589.67, 589.67, 598.75, 598.75, 598.75, 598.75, 598.75, 604.3, 604.3, 604.3, 604.3, 604.3, 618.34, 618.34, 618.34, 618.34, 618.34, 640.42, 640.42, 640.42, 640.42, 640.42, 645.55, 645.55, 645.55, 645.55, 645.55, 653.44, 653.44, 653.44, 653.44, 653.44, 652.83, 652.83, 652.83, 652.83, 652.83, 657.94, 657.94, 657.94, 657.94, 657.94, 658.94, 658.94, 658.94, 658.94, 658.94, 658.29, 658.29, 658.29, 658.29, 658.29, 672.4, 672.4, 672.4, 672.4, 672.4, 671.02, 671.02, 671.02, 671.02, 671.02, 670.99, 670.99, 670.99, 670.99, 670.99, 670.47, 670.47, 670.47, 670.47, 670.47, 675.54, 675.54, 675.54, 675.54, 675.54, 678.71, 678.71, 678.71, 678.71, 678.71, 680.66, 680.66, 680.66, 680.66, 680.66, 669.51, 669.51, 669.51, 669.51, 669.51, 647.02, 647.02, 647.02, 647.02, 647.02, 648.16, 648.16, 648.16, 648.16, 648.16, 654.46, 654.46, 654.46, 654.46, 654.46, 655.34, 655.34, 655.34, 655.34, 655.34, 656.6, 656.6, 656.6, 656.6, 656.6, 656.95, 656.95, 656.95, 656.95, 656.95, 657.89, 657.89, 657.89, 657.89, 657.89, 659.7, 659.7, 659.7, 659.7, 659.7, 659.18, 659.18, 659.18, 659.18, 659.18, 658.76, 658.76, 658.76, 658.76, 658.76, 662.14, 662.14, 662.14, 662.14, 662.14, 665.57, 665.57, 665.57, 665.57, 665.57, 674.58, 674.58, 674.58, 674.58, 674.58, 675.17, 675.17, 675.17, 675.17, 675.17, 641.44, 641.44, 641.44, 641.44, 641.44, 641.77, 641.77, 641.77, 641.77, 641.77, 640.81, 640.81, 640.81, 640.81, 640.81, 639.29, 639.29, 639.29, 639.29, 639.29, 643.03, 643.03, 643.03, 643.03, 643.03, 644.65, 644.65, 644.65, 644.65, 644.65, 638.66, 638.66, 638.66, 638.66, 638.66, 628.03, 628.03, 628.03, 628.03, 628.03, 626.66, 626.66, 626.66, 626.66, 626.66, 626.16, 626.16, 626.16, 626.16, 626.16, 625.83, 625.83, 625.83, 625.83, 625.83, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 619.45, 619.45, 619.45, 619.45, 619.45, 625.63, 625.63, 625.63, 625.63, 625.63, 625.93, 625.93, 625.93, 625.93, 625.93, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 34.28, 34.28, 34.28, 34.28, 34.28, 24.96, 24.96, 24.96, 24.96, 24.96, 22.09, 22.09, 22.09, 22.09, 22.09, 23.37, 23.37, 23.37, 23.37, 23.37, 24.41, 24.41, 24.41, 24.41, 24.41, 23.1, 23.1, 23.1, 23.1, 23.1, 23.16, 23.16, 23.16, 23.16, 23.16, 23.84, 23.84, 23.84, 23.84, 23.84, 24.5, 24.5, 24.5, 24.5, 24.5, 24.51, 24.51, 24.51, 24.51, 24.51, 24.63, 24.63, 24.63, 24.63, 24.63, 24.23, 24.23, 24.23, 24.23, 24.23, 24.32, 24.32, 24.32, 24.32, 24.32, 24.25, 24.25, 24.25, 24.25, 24.25, 24.22, 24.22, 24.22, 24.22, 24.22, 23.56, 23.56, 23.56, 23.56, 23.56, 23.23, 23.23, 23.23, 23.23, 23.23, 22.89, 22.89, 22.89, 22.89, 22.89, 22.49, 22.49, 22.49, 22.49, 22.49, 22.48, 22.48, 22.48, 22.48, 22.48, 22.55, 22.55, 22.55, 22.55, 22.55, 22.63, 22.63, 22.63, 22.63, 22.63, 22.24, 22.24, 22.24, 22.24, 22.24, 22.16, 22.16, 22.16, 22.16, 22.16, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.86, 21.86, 21.86, 21.86, 21.86, 21.96, 21.96, 21.96, 21.96, 21.96, 22.06, 22.06, 22.06, 22.06, 22.06, 22.12, 22.12, 22.12, 22.12, 22.12, 22.2, 22.2, 22.2, 22.2, 22.2, 22.29, 22.29, 22.29, 22.29, 22.29, 22.24, 22.24, 22.24, 22.24, 22.24, 22.09, 22.09, 22.09, 22.09, 22.09, 22.19, 22.19, 22.19, 22.19, 22.19, 22.42, 22.42, 22.42, 22.42, 22.42, 22.5, 22.5, 22.5, 22.5, 22.5, 22.6, 22.6, 22.6, 22.6, 22.6, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.74, 22.74, 22.74, 22.74, 22.74, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.64, 22.64, 22.64, 22.64, 22.64, 22.57, 22.57, 22.57, 22.57, 22.57, 22.52, 22.52, 22.52, 22.52, 22.52, 22.5, 22.5, 22.5, 22.5, 22.5, 22.57, 22.57, 22.57, 22.57, 22.57, 22.74, 22.74, 22.74, 22.74, 22.74, 22.82, 22.82, 22.82, 22.82, 22.82, 22.76, 22.76, 22.76, 22.76, 22.76, 22.63, 22.63, 22.63, 22.63, 22.63, 22.42, 22.42, 22.42, 22.42, 22.42, 22.21, 22.21, 22.21, 22.21, 22.21, 22.16, 22.16, 22.16, 22.16, 22.16, 21.75, 21.75, 21.75, 21.75, 21.75, 20.95, 20.95, 20.95, 20.95, 20.95, 20.96, 20.96, 20.96, 20.96, 20.96, 21.03, 21.03, 21.03, 21.03, 21.03, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.31, 0.31, 0.31, 0.31, 0.31, 0.47, 0.47, 0.47, 0.47, 0.47, 0.54, 0.54, 0.54, 0.54, 0.54, 0.48, 0.48, 0.48, 0.48, 0.48, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.56, 0.56, 0.56, 0.56, 0.56, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]

slaren · 2024-04-05T11:02:51Z

This is an important feature, but I would be opposed to merging an implementation that just assumes a value for the KV and compute buffer sizes, or doesn't correctly calculate the size of the layers, or ignores the CUDA pool size, or ignores multiple GPUs, or any other factors that would need to be taken into account. I would rather have the user test different values manually, than to give them a wrong default that they will assume is correct. Realistically, this cannot be implemented correctly without making significant changes to the architecture of llama.cpp.

phymbert · 2024-04-05T11:10:44Z

Yes, vLLM is doing it with a gpu memory target ratio, but you often got cuda malloc issue and need to decrease the target. So this is equivalent of our current ngl.

SleepyYui · 2024-04-05T17:52:00Z

This is an important feature, but I would be opposed to merging an implementation that just assumes a value for the KV and compute buffer sizes, or doesn't correctly calculate the size of the layers, or ignores the CUDA pool size, or ignores multiple GPUs, or any other factors that would need to be taken into account. I would rather have the user test different values manually, than to give them a wrong default that they will assume is correct. Realistically, this cannot be implemented correctly without making significant changes to the architecture of llama.cpp.

Assuming KV was just to get a prototype done, calculating the layers is nearly done on my side (the Guesstimation was just for prototyping as well). I'm pretty sure that it can be implemented for single GPU use without any major changes.
Will update once I'm done with the layer calculations.

JohannesGaessler

As slaren said, this would definitely be a welcome addition, but there are many edge cases that need to be taken into account.

common/common.cpp

common/common.h

JohannesGaessler · 2024-04-07T20:08:41Z

llama.cpp

+static int llm_determine_max_ngl(const llama_model_loader & ml, const llama_model & model, const int main_gpu) {
+    const auto & hparams = model.hparams;
+
+    size_t available_gpu_memory = llama_get_available_device_memory(main_gpu);


The program logic here is inconsistent with the --help. The --help says "based on VRAM size" which I would interpret as total VRAM but here you are using how much free VRAM there is.

JohannesGaessler · 2024-04-07T20:11:24Z

llama.cpp

+    available_gpu_memory = available_gpu_memory - total_buf_size; // buffer size
+
+    // Calculate the maximum number of layers that can fit into the GPU memory
+    int32_t max_ngl = std::floor(static_cast<float>(available_gpu_memory) / memory_per_layer);


You should leave a small amount of headroom when the number of layers is determined automatically; you would want to avoid a scenario where an application ooms because llama.cpp only left 5 MB of VRAM.

I thought about doing that; thanks for reminding

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

SleepyYui · 2024-04-08T10:20:05Z

KV still needs to be appropriately calculated. (will do soon) Otherwise, it should be fine now. Anything I missed?

JohannesGaessler · 2024-04-08T15:21:32Z

Anything I missed?

Quite frankly there are still many edge cases that are not yet being considered. Just off the top of my head:

VRAM for non-repeating tensors.
The multi GPU code is simply incorrect. It is not enough to check the total VRAM use against the total VRAM available. Instead the auto option needs to ensure that each individual GPU does not oom. --split-mode row currently distributes weights but not the compute buffer or the context to other GPUs.
--tensor-split: if the user does not specify one it should fill the GPUs in such a way that maximizes VRAM use. If the user does specify a tensor split the auto option should respect it.
--parallel parameter for multiple parallel sequences to decode.
Speculative decoding.
--no-kv-offload.
--cache-type-k, --cache-type-v.
VRAM needed for the buffer pool used for temporary buffers.

slaren · 2024-04-08T15:38:50Z

The design also needs to take into account possible changes to other parts of the code. It is not ok to just duplicate the calculations in this function. This will become very quickly outdated, and then it will become a unacceptable maintenance burden, or just plain wrong and misleading. And this cannot be done without very significant changes to the llama.cpp and backend interfaces. I already mentioned this: realistically this cannot be done without significant changes to the architecture.

schmorp · 2024-04-11T13:04:37Z

It would be extremely nice if this were not "fully automatic one size fits all" but if one could specify absolute limits per gpu, so one can e.g. leave enough space for other workloads (such as the OS).

WilliamTambellini · 2024-05-07T22:14:17Z

@slaren I understand this is a sensitive feature but FYI we are very strongly interested by that one. If anything we could do. Best.

JohannesGaessler · 2024-05-07T22:17:48Z

Are you saying you're interested in implementing the feature or using it?

SleepyYui · 2024-05-20T17:10:08Z

I'm willing to resume this in the future if I either find an easier way to handle it or there are compromises you guys are willing to accept. I didn't have that much time recently that's why this comes a little late...

AshD · 2024-05-30T22:11:00Z

I am interested in this feature and can explain our use case. We have a client app for Windows, Fusion Quill. It uses llama.cpp and runs on end user Windows 11/10 PCs.

There is no easy way to tell the user what is the optimal configuration for the best number_of_layers to offload to the GPU. Having llama.cpp calculate it, is the best solution for our use case.

It can be a new optional parameter so it does not interfere with current use cases. Multi-GPU support can be a future implementation. I have dual GPUs but most of my end users don't.

Also, If someone has ideas on how this can be done via code outside of llama.cpp, I would be interested in hearing them.

implement automatic max ngl detection

1e66c3a

SleepyYui marked this pull request as draft April 5, 2024 10:21

phymbert reviewed Apr 5, 2024

View reviewed changes

SleepyYui added 2 commits April 6, 2024 16:30

Merge branch 'ggerganov:master' into master

746d5fb

implement ggml_backend_sycl_get_free_device_memory for all GPU support

2549662

JohannesGaessler reviewed Apr 7, 2024

View reviewed changes

SleepyYui and others added 4 commits April 8, 2024 09:17

Update common/common.cpp (suggested by JohannesGaessler)

2325ec0

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Update common/common.h (suggested by JohannesGaessler)

a612032

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

calculate max ngl by looking at block sizes

1589e52

add multi-gpu support for ngl calculations

48fbf8c

Jeximo mentioned this pull request Apr 11, 2024

Auto calculate max number of layers with -ngl -1? #6594

Open

mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request need feedback Testing and feedback with results are needed labels May 10, 2024

dranger003 mentioned this pull request May 30, 2024

Retrieving the context length dranger003/llama.cpp-dotnet#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement automatic NGL detection #6502

Implement automatic NGL detection #6502

SleepyYui commented Apr 5, 2024

SleepyYui commented Apr 5, 2024 •

edited

phymbert Apr 5, 2024

SleepyYui Apr 5, 2024

phymbert Apr 5, 2024

github-actions bot commented Apr 5, 2024 •

edited

slaren commented Apr 5, 2024

phymbert commented Apr 5, 2024

SleepyYui commented Apr 5, 2024

JohannesGaessler left a comment

JohannesGaessler Apr 7, 2024

JohannesGaessler Apr 7, 2024

SleepyYui Apr 8, 2024

SleepyYui commented Apr 8, 2024

JohannesGaessler commented Apr 8, 2024

slaren commented Apr 8, 2024

schmorp commented Apr 11, 2024

WilliamTambellini commented May 7, 2024

JohannesGaessler commented May 7, 2024

SleepyYui commented May 20, 2024

AshD commented May 30, 2024

Implement automatic NGL detection #6502

Are you sure you want to change the base?

Implement automatic NGL detection #6502

Conversation

SleepyYui commented Apr 5, 2024

SleepyYui commented Apr 5, 2024 • edited

phymbert Apr 5, 2024

Choose a reason for hiding this comment

SleepyYui Apr 5, 2024

Choose a reason for hiding this comment

phymbert Apr 5, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 5, 2024 • edited

slaren commented Apr 5, 2024

phymbert commented Apr 5, 2024

SleepyYui commented Apr 5, 2024

JohannesGaessler left a comment

Choose a reason for hiding this comment

JohannesGaessler Apr 7, 2024

Choose a reason for hiding this comment

JohannesGaessler Apr 7, 2024

Choose a reason for hiding this comment

SleepyYui Apr 8, 2024

Choose a reason for hiding this comment

SleepyYui commented Apr 8, 2024

JohannesGaessler commented Apr 8, 2024

slaren commented Apr 8, 2024

schmorp commented Apr 11, 2024

WilliamTambellini commented May 7, 2024

JohannesGaessler commented May 7, 2024

SleepyYui commented May 20, 2024

AshD commented May 30, 2024

SleepyYui commented Apr 5, 2024 •

edited

github-actions bot commented Apr 5, 2024 •

edited