Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-backend: refine backend subsystem for CPU&GPU / CPU&NPU mixed inference more easily for a specified GGML backend #7641

Closed

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented May 30, 2024

Purpose

This PR is intent to refine ggml backend subsystem to enable mixed inference between CPU & GPU / CPU & NPU more easily.

There already is "Backend Scheduler" feature in ggml backend subsystem but the "Backend Scheduler" is too complex and not a straight way and some backend APIs is not make sense:

For example, ggml_backend_supports_op is only called/used in https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L406,

For example, ggml_backend_offload_op is not reasonable.

In the all, a special backend doesn't need to implement all GGML OPs and much of them can fallback to the default GGML backend(this is a long-term problem in ggml backend subsystem):

  • The entire framework of existing ggml backend subystem is really excellent, but part of subsystem seems too strict to a special backend;

  • GPU/NPU computing might be slower then CPU computing in some special scenarios if we considering data copy/data preparation between CPU/GPU or CPU/NPU and memory size or KV cache size.

Pros

This PR less then one hundred LoC based on the existing ggml backend subsystem and NO side-effect to existing codes.

This PR follow the existing OO principle in ggml.c&ggml-backend.c.

This PR works very fine/well with whisper.cpp and llama.cpp using QNN backend as expected on local dev side.

The GGML QNN backend and many other GGML backends will/might be benefit from this PR greatly.

It's very simple and straightforward and easy to understand.

Cons

A static function in ggml.c is changed to a global function and referenced in this PR. this is not make sense but the cost might be acceptable. A workaround to fix this problem is merge the entire ggml-backend.c to ggml.c and ggml-backend.h to ggml.h accordingly.

Todo

more sophisticated algorithm for mixed inference between CPU/GPU or CPU/NPU but this PR is a simple and concise and straight implementation for address a long-term problem in ggml backend subsystem.

…ference more easily for a specified GGML backend
@slaren
Copy link
Collaborator

slaren commented May 30, 2024

This is not correct.

@slaren slaren closed this May 30, 2024
@zhouwg
Copy link
Contributor Author

zhouwg commented May 30, 2024

This is not correct.

it works fine with whisper.cpp and llama.cpp using QNN backend and various testcases in my local dev envs.

could you help to point out the reason. thanks.

@slaren
Copy link
Collaborator

slaren commented May 30, 2024

There are too many things wrong here to list. At the most basic level, this approach will not work because backends typically have a memory that is not accessible from other backends, and when switching to a different backend it is necessary to ensure that all the tensors required to evaluate the graph are available in the backend memory. This is the main job of ggml_backend_sched.

Please wait until #6210 is complete, then ggml_backend_sched will be able to automatically run operations not supported by the backend in the CPU backend.

@zhouwg
Copy link
Contributor Author

zhouwg commented May 30, 2024

There are too many things wrong here to list. At the most basic level, this approach will not work because backends typically have a memory that is not accessible from other backends, and when switching to a different backend it is necessary to ensure that all the tensors required to evaluate the graph are available in the backend memory. This is the main job of ggml_backend_sched.

Please wait until #6210 is complete, then ggml_backend_sched will be able to automatically run operations not supported by the backend in the CPU backend.

This PR has no side-effect to the existing codes and works very well/perfectly with whisper.cpp and llama.cpp using QNN backend(I guess other new backend also works fine with whisper.cpp and llama.cpp if a new backend follow the style in this PR). I had been considered your concern carefully: the other/existing backend still keep the original behavior.

In the fact,

  • this PR is still base on the existing excellent ggml backend subsystem but just a new method to the complex/complicated "Backend Sched" feature in ggml backend subsystem.

  • any backends only need to use system memroy if the backend's ggml_backend_xxx_buffer_is_host return true(for example:qnn backend), so your concern is not quite correct.

  • any new backend can following this style if the backend's ggml_backend_xxx_buffer_is_host return true.

  • the existing backends still keep their behaviors and the new backend can follow this new way.

  • the "Backend Sched" provided by you can still be used for other scenarios(for example in complicated scenarios in llama.cpp)

Could you help to reopen this PR? so other programmers/developers can participate in the debate. Let community developers to decide whether this PR could be accepted. Thanks so much.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 538 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8670.67ms p(95)=21417.79ms fails=, finish reason: stop=493 truncated=45
  • Prompt processing (pp): avg=103.42tk/s p(95)=463.56tk/s
  • Token generation (tg): avg=45.22tk/s p(95)=47.3tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=refine-ggml-backend-subsystem commit=5b36de7ec3a0b965ca998da4bd7616ea3efe73d3

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717088047 --> 1717088671
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 426.97, 426.97, 426.97, 426.97, 426.97, 839.02, 839.02, 839.02, 839.02, 839.02, 853.91, 853.91, 853.91, 853.91, 853.91, 849.13, 849.13, 849.13, 849.13, 849.13, 915.45, 915.45, 915.45, 915.45, 915.45, 921.82, 921.82, 921.82, 921.82, 921.82, 901.31, 901.31, 901.31, 901.31, 901.31, 895.96, 895.96, 895.96, 895.96, 895.96, 888.96, 888.96, 888.96, 888.96, 888.96, 898.97, 898.97, 898.97, 898.97, 898.97, 900.91, 900.91, 900.91, 900.91, 900.91, 893.92, 893.92, 893.92, 893.92, 893.92, 920.71, 920.71, 920.71, 920.71, 920.71, 913.42, 913.42, 913.42, 913.42, 913.42, 921.44, 921.44, 921.44, 921.44, 921.44, 902.24, 902.24, 902.24, 902.24, 902.24, 904.63, 904.63, 904.63, 904.63, 904.63, 905.58, 905.58, 905.58, 905.58, 905.58, 900.86, 900.86, 900.86, 900.86, 900.86, 864.47, 864.47, 864.47, 864.47, 864.47, 864.62, 864.62, 864.62, 864.62, 864.62, 861.78, 861.78, 861.78, 861.78, 861.78, 868.25, 868.25, 868.25, 868.25, 868.25, 870.16, 870.16, 870.16, 870.16, 870.16, 816.09, 816.09, 816.09, 816.09, 816.09, 817.74, 817.74, 817.74, 817.74, 817.74, 819.7, 819.7, 819.7, 819.7, 819.7, 833.66, 833.66, 833.66, 833.66, 833.66, 834.75, 834.75, 834.75, 834.75, 834.75, 833.67, 833.67, 833.67, 833.67, 833.67, 835.58, 835.58, 835.58, 835.58, 835.58, 836.31, 836.31, 836.31, 836.31, 836.31, 834.33, 834.33, 834.33, 834.33, 834.33, 833.46, 833.46, 833.46, 833.46, 833.46, 839.92, 839.92, 839.92, 839.92, 839.92, 837.74, 837.74, 837.74, 837.74, 837.74, 845.84, 845.84, 845.84, 845.84, 845.84, 847.42, 847.42, 847.42, 847.42, 847.42, 847.19, 847.19, 847.19, 847.19, 847.19, 850.11, 850.11, 850.11, 850.11, 850.11, 852.85, 852.85, 852.85, 852.85, 852.85, 857.24, 857.24, 857.24, 857.24, 857.24, 850.25, 850.25, 850.25, 850.25, 850.25, 850.73, 850.73, 850.73, 850.73, 850.73, 848.87, 848.87, 848.87, 848.87, 848.87, 846.87, 846.87, 846.87, 846.87, 846.87, 850.46, 850.46, 850.46, 850.46, 850.46, 848.98, 848.98, 848.98, 848.98, 848.98, 848.31, 848.31, 848.31, 848.31, 848.31, 852.87, 852.87, 852.87, 852.87, 852.87, 852.0, 852.0, 852.0, 852.0, 852.0, 854.21, 854.21, 854.21, 854.21, 854.21, 856.38, 856.38, 856.38, 856.38, 856.38, 856.03, 856.03, 856.03, 856.03, 856.03, 856.69, 856.69, 856.69, 856.69, 856.69, 859.86, 859.86, 859.86, 859.86, 859.86, 860.47, 860.47, 860.47, 860.47, 860.47, 859.95, 859.95, 859.95, 859.95, 859.95, 860.7, 860.7, 860.7, 860.7, 860.7, 860.38, 860.38, 860.38, 860.38, 860.38, 862.0, 862.0, 862.0]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717088047 --> 1717088671
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.41, 44.41, 44.41, 44.41, 44.41, 31.09, 31.09, 31.09, 31.09, 31.09, 27.18, 27.18, 27.18, 27.18, 27.18, 27.44, 27.44, 27.44, 27.44, 27.44, 29.59, 29.59, 29.59, 29.59, 29.59, 29.7, 29.7, 29.7, 29.7, 29.7, 30.56, 30.56, 30.56, 30.56, 30.56, 31.63, 31.63, 31.63, 31.63, 31.63, 32.23, 32.23, 32.23, 32.23, 32.23, 32.84, 32.84, 32.84, 32.84, 32.84, 33.16, 33.16, 33.16, 33.16, 33.16, 33.28, 33.28, 33.28, 33.28, 33.28, 32.34, 32.34, 32.34, 32.34, 32.34, 31.48, 31.48, 31.48, 31.48, 31.48, 31.41, 31.41, 31.41, 31.41, 31.41, 29.94, 29.94, 29.94, 29.94, 29.94, 28.98, 28.98, 28.98, 28.98, 28.98, 28.57, 28.57, 28.57, 28.57, 28.57, 28.75, 28.75, 28.75, 28.75, 28.75, 28.91, 28.91, 28.91, 28.91, 28.91, 28.79, 28.79, 28.79, 28.79, 28.79, 28.5, 28.5, 28.5, 28.5, 28.5, 28.56, 28.56, 28.56, 28.56, 28.56, 28.86, 28.86, 28.86, 28.86, 28.86, 28.95, 28.95, 28.95, 28.95, 28.95, 28.96, 28.96, 28.96, 28.96, 28.96, 29.45, 29.45, 29.45, 29.45, 29.45, 29.28, 29.28, 29.28, 29.28, 29.28, 29.24, 29.24, 29.24, 29.24, 29.24, 29.42, 29.42, 29.42, 29.42, 29.42, 29.58, 29.58, 29.58, 29.58, 29.58, 29.8, 29.8, 29.8, 29.8, 29.8, 29.92, 29.92, 29.92, 29.92, 29.92, 29.98, 29.98, 29.98, 29.98, 29.98, 30.05, 30.05, 30.05, 30.05, 30.05, 29.94, 29.94, 29.94, 29.94, 29.94, 29.93, 29.93, 29.93, 29.93, 29.93, 29.81, 29.81, 29.81, 29.81, 29.81, 30.04, 30.04, 30.04, 30.04, 30.04, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.46, 30.46, 30.46, 30.46, 30.46, 30.35, 30.35, 30.35, 30.35, 30.35, 30.05, 30.05, 30.05, 30.05, 30.05, 29.76, 29.76, 29.76, 29.76, 29.76, 28.65, 28.65, 28.65, 28.65, 28.65, 28.52, 28.52, 28.52, 28.52, 28.52, 28.54, 28.54, 28.54, 28.54, 28.54, 28.57, 28.57, 28.57, 28.57, 28.57, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.72, 28.72, 28.72, 28.72, 28.72, 28.72, 28.72, 28.72, 28.72, 28.72, 28.65, 28.65, 28.65, 28.65, 28.65, 28.52, 28.52, 28.52, 28.52, 28.52, 28.38, 28.38, 28.38, 28.38, 28.38, 28.41, 28.41, 28.41, 28.41, 28.41, 28.55, 28.55, 28.55, 28.55, 28.55, 28.71, 28.71, 28.71, 28.71, 28.71, 28.81, 28.81, 28.81, 28.81, 28.81, 28.97, 28.97, 28.97]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717088047 --> 1717088671
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.39, 0.39, 0.39, 0.39, 0.39, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.35, 0.35, 0.35, 0.35, 0.35, 0.32, 0.32, 0.32, 0.32, 0.32, 0.43, 0.43, 0.43, 0.43, 0.43, 0.34, 0.34, 0.34, 0.34, 0.34, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.34, 0.34, 0.34, 0.34, 0.34, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.24, 0.24, 0.24, 0.24, 0.24, 0.49, 0.49, 0.49, 0.49, 0.49, 0.63, 0.63, 0.63, 0.63, 0.63, 0.59, 0.59, 0.59, 0.59, 0.59, 0.43, 0.43, 0.43, 0.43, 0.43, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717088047 --> 1717088671
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0]
                    
Loading

@zhouwg
Copy link
Contributor Author

zhouwg commented May 31, 2024

This PR is NOT closed by myself.

There is a more clear PR(with more code comments to explain how to do mixed inference between Qualcomm's CPU&GPU / CPU/NPU):

#7679

I submitted this new PR because I can't update(submit a new commit in this PR) in this loop(I don't know why).

zhouwg added a commit to zhouwg/kantv that referenced this pull request May 31, 2024
zhouwg added a commit to zhouwg/kantv that referenced this pull request May 31, 2024
zhouwg added a commit to zhouwg/kantv that referenced this pull request May 31, 2024
zhouwg referenced this pull request in zhouwg/kantv May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants