Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement automatic NGL detection #6502

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

SleepyYui
Copy link
Contributor

This is not ready for merging; I still want to change/improve some stuff.

I implemented the option to pass "a" or "auto" with the -ngl parameter to automatically detect the maximum amount of layers that fit into the VRAM. Some stuff is still hard-coded or implemented weirdly; I'll improve that in the next commit(s).

Feedback is most definitely appreciated.

@SleepyYui
Copy link
Contributor Author

SleepyYui commented Apr 5, 2024

When loading a 13B model into a 6 gig GPU

./main -m models/WV13B/Wizard-Vicuna-13B.Q5_K_M.gguf -ngl a
Log start
main: build = 2612 (1e66c3a7)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1712312099
...
llm_load_print_meta: model size       = 8.60 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = ehartford_wizard-vicuna-13b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: automatically set n_gpu_layers = 22    // Chose 22 layers since 23 
                                                         // are already too much to handle
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/41 layers to GPU
llm_load_tensors:        CPU buffer size =  8801.63 MiB
llm_load_tensors:      CUDA0 buffer size =  4711.31 MiB

@SleepyYui SleepyYui marked this pull request as draft April 5, 2024 10:21
@@ -836,7 +836,12 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
invalid_param = true;
return true;
}
params.n_gpu_layers = std::stoi(argv[i]);
std::string argValue = argv[i];
if (argValue == "auto" || argValue == "a") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it can be a breaking change, but I would prefer to have this approach as the default. i.e. if -ngl is not passed: automatically offload the maximum possible layers to VRAM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. If someone doesn't want that, they could simply -ngl 0 or just not compile with GPU args passed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this is just my personal point of view. @ggerganov or @slaren would have a better global view

Copy link
Contributor

github-actions bot commented Apr 5, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 442 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10663.68ms p(95)=27346.17ms fails=, finish reason: stop=386 truncated=56
  • Prompt processing (pp): avg=116.61tk/s p(95)=566.61tk/s
  • Token generation (tg): avg=23.52tk/s p(95)=36.14tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=2549662cde1861755515b75bc5e2a1b83f62d63b

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 575.08, 575.08, 575.08, 575.08, 575.08, 474.9, 474.9, 474.9, 474.9, 474.9, 493.73, 493.73, 493.73, 493.73, 493.73, 521.57, 521.57, 521.57, 521.57, 521.57, 551.82, 551.82, 551.82, 551.82, 551.82, 549.79, 549.79, 549.79, 549.79, 549.79, 550.23, 550.23, 550.23, 550.23, 550.23, 553.51, 553.51, 553.51, 553.51, 553.51, 579.91, 579.91, 579.91, 579.91, 579.91, 588.31, 588.31, 588.31, 588.31, 588.31, 589.67, 589.67, 589.67, 589.67, 589.67, 598.75, 598.75, 598.75, 598.75, 598.75, 604.3, 604.3, 604.3, 604.3, 604.3, 618.34, 618.34, 618.34, 618.34, 618.34, 640.42, 640.42, 640.42, 640.42, 640.42, 645.55, 645.55, 645.55, 645.55, 645.55, 653.44, 653.44, 653.44, 653.44, 653.44, 652.83, 652.83, 652.83, 652.83, 652.83, 657.94, 657.94, 657.94, 657.94, 657.94, 658.94, 658.94, 658.94, 658.94, 658.94, 658.29, 658.29, 658.29, 658.29, 658.29, 672.4, 672.4, 672.4, 672.4, 672.4, 671.02, 671.02, 671.02, 671.02, 671.02, 670.99, 670.99, 670.99, 670.99, 670.99, 670.47, 670.47, 670.47, 670.47, 670.47, 675.54, 675.54, 675.54, 675.54, 675.54, 678.71, 678.71, 678.71, 678.71, 678.71, 680.66, 680.66, 680.66, 680.66, 680.66, 669.51, 669.51, 669.51, 669.51, 669.51, 647.02, 647.02, 647.02, 647.02, 647.02, 648.16, 648.16, 648.16, 648.16, 648.16, 654.46, 654.46, 654.46, 654.46, 654.46, 655.34, 655.34, 655.34, 655.34, 655.34, 656.6, 656.6, 656.6, 656.6, 656.6, 656.95, 656.95, 656.95, 656.95, 656.95, 657.89, 657.89, 657.89, 657.89, 657.89, 659.7, 659.7, 659.7, 659.7, 659.7, 659.18, 659.18, 659.18, 659.18, 659.18, 658.76, 658.76, 658.76, 658.76, 658.76, 662.14, 662.14, 662.14, 662.14, 662.14, 665.57, 665.57, 665.57, 665.57, 665.57, 674.58, 674.58, 674.58, 674.58, 674.58, 675.17, 675.17, 675.17, 675.17, 675.17, 641.44, 641.44, 641.44, 641.44, 641.44, 641.77, 641.77, 641.77, 641.77, 641.77, 640.81, 640.81, 640.81, 640.81, 640.81, 639.29, 639.29, 639.29, 639.29, 639.29, 643.03, 643.03, 643.03, 643.03, 643.03, 644.65, 644.65, 644.65, 644.65, 644.65, 638.66, 638.66, 638.66, 638.66, 638.66, 628.03, 628.03, 628.03, 628.03, 628.03, 626.66, 626.66, 626.66, 626.66, 626.66, 626.16, 626.16, 626.16, 626.16, 626.16, 625.83, 625.83, 625.83, 625.83, 625.83, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 624.46, 619.45, 619.45, 619.45, 619.45, 619.45, 625.63, 625.63, 625.63, 625.63, 625.63, 625.93, 625.93, 625.93, 625.93, 625.93, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71, 629.71]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 34.28, 34.28, 34.28, 34.28, 34.28, 24.96, 24.96, 24.96, 24.96, 24.96, 22.09, 22.09, 22.09, 22.09, 22.09, 23.37, 23.37, 23.37, 23.37, 23.37, 24.41, 24.41, 24.41, 24.41, 24.41, 23.1, 23.1, 23.1, 23.1, 23.1, 23.16, 23.16, 23.16, 23.16, 23.16, 23.84, 23.84, 23.84, 23.84, 23.84, 24.5, 24.5, 24.5, 24.5, 24.5, 24.51, 24.51, 24.51, 24.51, 24.51, 24.63, 24.63, 24.63, 24.63, 24.63, 24.23, 24.23, 24.23, 24.23, 24.23, 24.32, 24.32, 24.32, 24.32, 24.32, 24.25, 24.25, 24.25, 24.25, 24.25, 24.22, 24.22, 24.22, 24.22, 24.22, 23.56, 23.56, 23.56, 23.56, 23.56, 23.23, 23.23, 23.23, 23.23, 23.23, 22.89, 22.89, 22.89, 22.89, 22.89, 22.49, 22.49, 22.49, 22.49, 22.49, 22.48, 22.48, 22.48, 22.48, 22.48, 22.55, 22.55, 22.55, 22.55, 22.55, 22.63, 22.63, 22.63, 22.63, 22.63, 22.24, 22.24, 22.24, 22.24, 22.24, 22.16, 22.16, 22.16, 22.16, 22.16, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.93, 21.86, 21.86, 21.86, 21.86, 21.86, 21.96, 21.96, 21.96, 21.96, 21.96, 22.06, 22.06, 22.06, 22.06, 22.06, 22.12, 22.12, 22.12, 22.12, 22.12, 22.2, 22.2, 22.2, 22.2, 22.2, 22.29, 22.29, 22.29, 22.29, 22.29, 22.24, 22.24, 22.24, 22.24, 22.24, 22.09, 22.09, 22.09, 22.09, 22.09, 22.19, 22.19, 22.19, 22.19, 22.19, 22.42, 22.42, 22.42, 22.42, 22.42, 22.5, 22.5, 22.5, 22.5, 22.5, 22.6, 22.6, 22.6, 22.6, 22.6, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.73, 22.74, 22.74, 22.74, 22.74, 22.74, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.7, 22.64, 22.64, 22.64, 22.64, 22.64, 22.57, 22.57, 22.57, 22.57, 22.57, 22.52, 22.52, 22.52, 22.52, 22.52, 22.5, 22.5, 22.5, 22.5, 22.5, 22.57, 22.57, 22.57, 22.57, 22.57, 22.74, 22.74, 22.74, 22.74, 22.74, 22.82, 22.82, 22.82, 22.82, 22.82, 22.76, 22.76, 22.76, 22.76, 22.76, 22.63, 22.63, 22.63, 22.63, 22.63, 22.42, 22.42, 22.42, 22.42, 22.42, 22.21, 22.21, 22.21, 22.21, 22.21, 22.16, 22.16, 22.16, 22.16, 22.16, 21.75, 21.75, 21.75, 21.75, 21.75, 20.95, 20.95, 20.95, 20.95, 20.95, 20.96, 20.96, 20.96, 20.96, 20.96, 21.03, 21.03, 21.03, 21.03, 21.03, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.31, 0.31, 0.31, 0.31, 0.31, 0.47, 0.47, 0.47, 0.47, 0.47, 0.54, 0.54, 0.54, 0.54, 0.54, 0.48, 0.48, 0.48, 0.48, 0.48, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.56, 0.56, 0.56, 0.56, 0.56, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 442 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712416884 --> 1712417518
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]
                    

@slaren
Copy link
Collaborator

slaren commented Apr 5, 2024

This is an important feature, but I would be opposed to merging an implementation that just assumes a value for the KV and compute buffer sizes, or doesn't correctly calculate the size of the layers, or ignores the CUDA pool size, or ignores multiple GPUs, or any other factors that would need to be taken into account. I would rather have the user test different values manually, than to give them a wrong default that they will assume is correct. Realistically, this cannot be implemented correctly without making significant changes to the architecture of llama.cpp.

@phymbert
Copy link
Collaborator

phymbert commented Apr 5, 2024

Yes, vLLM is doing it with a gpu memory target ratio, but you often got cuda malloc issue and need to decrease the target. So this is equivalent of our current ngl.

@SleepyYui
Copy link
Contributor Author

This is an important feature, but I would be opposed to merging an implementation that just assumes a value for the KV and compute buffer sizes, or doesn't correctly calculate the size of the layers, or ignores the CUDA pool size, or ignores multiple GPUs, or any other factors that would need to be taken into account. I would rather have the user test different values manually, than to give them a wrong default that they will assume is correct. Realistically, this cannot be implemented correctly without making significant changes to the architecture of llama.cpp.

Assuming KV was just to get a prototype done, calculating the layers is nearly done on my side (the Guesstimation was just for prototyping as well). I'm pretty sure that it can be implemented for single GPU use without any major changes.
Will update once I'm done with the layer calculations.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As slaren said, this would definitely be a welcome addition, but there are many edge cases that need to be taken into account.

common/common.cpp Outdated Show resolved Hide resolved
common/common.h Outdated Show resolved Hide resolved
llama.cpp Outdated
static int llm_determine_max_ngl(const llama_model_loader & ml, const llama_model & model, const int main_gpu) {
const auto & hparams = model.hparams;

size_t available_gpu_memory = llama_get_available_device_memory(main_gpu);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The program logic here is inconsistent with the --help. The --help says "based on VRAM size" which I would interpret as total VRAM but here you are using how much free VRAM there is.

llama.cpp Outdated
available_gpu_memory = available_gpu_memory - total_buf_size; // buffer size

// Calculate the maximum number of layers that can fit into the GPU memory
int32_t max_ngl = std::floor(static_cast<float>(available_gpu_memory) / memory_per_layer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should leave a small amount of headroom when the number of layers is determined automatically; you would want to avoid a scenario where an application ooms because llama.cpp only left 5 MB of VRAM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that; thanks for reminding

@SleepyYui
Copy link
Contributor Author

KV still needs to be appropriately calculated. (will do soon) Otherwise, it should be fine now. Anything I missed?

@JohannesGaessler
Copy link
Collaborator

Anything I missed?

Quite frankly there are still many edge cases that are not yet being considered. Just off the top of my head:

  • VRAM for non-repeating tensors.
  • The multi GPU code is simply incorrect. It is not enough to check the total VRAM use against the total VRAM available. Instead the auto option needs to ensure that each individual GPU does not oom. --split-mode row currently distributes weights but not the compute buffer or the context to other GPUs.
  • --tensor-split: if the user does not specify one it should fill the GPUs in such a way that maximizes VRAM use. If the user does specify a tensor split the auto option should respect it.
  • --parallel parameter for multiple parallel sequences to decode.
  • Speculative decoding.
  • --no-kv-offload.
  • --cache-type-k, --cache-type-v.
  • VRAM needed for the buffer pool used for temporary buffers.

@slaren
Copy link
Collaborator

slaren commented Apr 8, 2024

The design also needs to take into account possible changes to other parts of the code. It is not ok to just duplicate the calculations in this function. This will become very quickly outdated, and then it will become a unacceptable maintenance burden, or just plain wrong and misleading. And this cannot be done without very significant changes to the llama.cpp and backend interfaces. I already mentioned this: realistically this cannot be done without significant changes to the architecture.

@schmorp
Copy link

schmorp commented Apr 11, 2024

It would be extremely nice if this were not "fully automatic one size fits all" but if one could specify absolute limits per gpu, so one can e.g. leave enough space for other workloads (such as the OS).

@WilliamTambellini
Copy link
Contributor

@slaren I understand this is a sensitive feature but FYI we are very strongly interested by that one. If anything we could do. Best.

@JohannesGaessler
Copy link
Collaborator

Are you saying you're interested in implementing the feature or using it?

@mofosyne mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request need feedback Testing and feedback with results are needed labels May 10, 2024
@SleepyYui
Copy link
Contributor Author

I'm willing to resume this in the future if I either find an easier way to handle it or there are compromises you guys are willing to accept. I didn't have that much time recently that's why this comes a little late...

@AshD
Copy link

AshD commented May 30, 2024

I am interested in this feature and can explain our use case. We have a client app for Windows, Fusion Quill. It uses llama.cpp and runs on end user Windows 11/10 PCs.

There is no easy way to tell the user what is the optimal configuration for the best number_of_layers to offload to the GPU. Having llama.cpp calculate it, is the best solution for our use case.

It can be a new optional parameter so it does not interfere with current use cases. Multi-GPU support can be a future implementation. I have dual GPUs but most of my end users don't.

Also, If someone has ideas on how this can be done via code outside of llama.cpp, I would be interested in hearing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need feedback Testing and feedback with results are needed review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants