Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] refactor #6408

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

[SYCL] refactor #6408

wants to merge 1 commit into from

Conversation

airMeng
Copy link
Collaborator

@airMeng airMeng commented Mar 31, 2024

according to #5277 (reply in thread), the PR does the following:

  • separate dpct generated headers for future maintaining
  • separate GEMM related operators for future template-based library introduction , AKA XeTLA
    - [ ] let the common backend to handle H2D/D2H memcpy. let the PR as simple as possible

@airMeng
Copy link
Collaborator Author

airMeng commented Mar 31, 2024

@slaren Since we can put SYCL related code under a directory instead of a single file, I might introduce headers-only library for performance optimization, as well as simplifying our effort too (my job during work time 😁 )

@ggerganov @mingfeima for aware

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 504 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9274.74ms p(90)=26479.05ms fails=0, finish reason: stop=504 truncated=0
  • Prompt processing (pp): avg=241.61tk/s p(90)=732.4tk/s total=200.65tk/s
  • Token generation (tg): avg=102.96tk/s p(90)=278.78tk/s total=129.75tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sycl-refactor commit=a2e77e60d6d1e208096aae27e24a23ff9821c58b
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 504 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711883059 --> 1711883683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 342.78, 342.78, 342.78, 342.78, 342.78, 719.03, 719.03, 719.03, 719.03, 719.03, 746.58, 746.58, 746.58, 746.58, 746.58, 764.11, 764.11, 764.11, 764.11, 764.11, 801.5, 801.5, 801.5, 801.5, 801.5, 801.28, 801.28, 801.28, 801.28, 801.28, 795.7, 795.7, 795.7, 795.7, 795.7, 775.96, 775.96, 775.96, 775.96, 775.96, 772.84, 772.84, 772.84, 772.84, 772.84, 776.85, 776.85, 776.85, 776.85, 776.85, 770.35, 770.35, 770.35, 770.35, 770.35, 774.25, 774.25, 774.25, 774.25, 774.25, 772.86, 772.86, 772.86, 772.86, 772.86, 785.43, 785.43, 785.43, 785.43, 785.43, 811.79, 811.79, 811.79, 811.79, 811.79, 757.51, 757.51, 757.51, 757.51, 757.51, 758.3, 758.3, 758.3, 758.3, 758.3, 755.16, 755.16, 755.16, 755.16, 755.16, 760.58, 760.58, 760.58, 760.58, 760.58, 757.26, 757.26, 757.26, 757.26, 757.26, 754.56, 754.56, 754.56, 754.56, 754.56, 753.07, 753.07, 753.07, 753.07, 753.07, 754.99, 754.99, 754.99, 754.99, 754.99, 754.33, 754.33, 754.33, 754.33, 754.33, 748.26, 748.26, 748.26, 748.26, 748.26, 754.61, 754.61, 754.61, 754.61, 754.61, 750.37, 750.37, 750.37, 750.37, 750.37, 749.32, 749.32, 749.32, 749.32, 749.32, 754.07, 754.07, 754.07, 754.07, 754.07, 751.72, 751.72, 751.72, 751.72, 751.72, 750.1, 750.1, 750.1, 750.1, 750.1, 750.7, 750.7, 750.7, 750.7, 750.7, 751.0, 751.0, 751.0, 751.0, 751.0, 749.45, 749.45, 749.45, 749.45, 749.45, 751.5, 751.5, 751.5, 751.5, 751.5, 761.28, 761.28, 761.28, 761.28, 761.28, 763.73, 763.73, 763.73, 763.73, 763.73, 764.15, 764.15, 764.15, 764.15, 764.15, 766.85, 766.85, 766.85, 766.85, 766.85, 764.68, 764.68, 764.68, 764.68, 764.68, 763.52, 763.52, 763.52, 763.52, 763.52, 748.97, 748.97, 748.97, 748.97, 748.97, 755.98, 755.98, 755.98, 755.98, 755.98, 731.38, 731.38, 731.38, 731.38, 731.38, 727.87, 727.87, 727.87, 727.87, 727.87, 727.34, 727.34, 727.34, 727.34, 727.34, 725.19, 725.19, 725.19, 725.19, 725.19, 723.16, 723.16, 723.16, 723.16, 723.16, 720.23, 720.23, 720.23, 720.23, 720.23, 721.25, 721.25, 721.25, 721.25, 721.25, 725.33, 725.33, 725.33, 725.33, 725.33, 725.09, 725.09, 725.09, 725.09, 725.09, 724.9, 724.9, 724.9, 724.9, 724.9, 727.62, 727.62, 727.62, 727.62, 727.62, 730.84, 730.84, 730.84, 730.84, 730.84, 731.61, 731.61, 731.61, 731.61, 731.61, 731.14, 731.14, 731.14, 731.14, 731.14, 730.26, 730.26, 730.26, 730.26, 730.26, 732.05, 732.05, 732.05, 732.05, 732.05, 732.57, 732.57, 732.57, 732.57, 732.57, 731.56, 731.56, 731.56]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 504 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711883059 --> 1711883683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.19, 33.19, 33.19, 33.19, 33.19, 26.34, 26.34, 26.34, 26.34, 26.34, 16.98, 16.98, 16.98, 16.98, 16.98, 17.33, 17.33, 17.33, 17.33, 17.33, 17.51, 17.51, 17.51, 17.51, 17.51, 18.31, 18.31, 18.31, 18.31, 18.31, 19.13, 19.13, 19.13, 19.13, 19.13, 19.71, 19.71, 19.71, 19.71, 19.71, 19.78, 19.78, 19.78, 19.78, 19.78, 19.86, 19.86, 19.86, 19.86, 19.86, 19.79, 19.79, 19.79, 19.79, 19.79, 19.71, 19.71, 19.71, 19.71, 19.71, 19.39, 19.39, 19.39, 19.39, 19.39, 19.13, 19.13, 19.13, 19.13, 19.13, 18.87, 18.87, 18.87, 18.87, 18.87, 18.25, 18.25, 18.25, 18.25, 18.25, 18.14, 18.14, 18.14, 18.14, 18.14, 18.14, 18.14, 18.14, 18.14, 18.14, 18.28, 18.28, 18.28, 18.28, 18.28, 18.18, 18.18, 18.18, 18.18, 18.18, 18.0, 18.0, 18.0, 18.0, 18.0, 17.97, 17.97, 17.97, 17.97, 17.97, 17.84, 17.84, 17.84, 17.84, 17.84, 17.82, 17.82, 17.82, 17.82, 17.82, 17.89, 17.89, 17.89, 17.89, 17.89, 17.93, 17.93, 17.93, 17.93, 17.93, 17.86, 17.86, 17.86, 17.86, 17.86, 17.87, 17.87, 17.87, 17.87, 17.87, 17.94, 17.94, 17.94, 17.94, 17.94, 17.9, 17.9, 17.9, 17.9, 17.9, 17.85, 17.85, 17.85, 17.85, 17.85, 17.99, 17.99, 17.99, 17.99, 17.99, 18.08, 18.08, 18.08, 18.08, 18.08, 18.19, 18.19, 18.19, 18.19, 18.19, 18.29, 18.29, 18.29, 18.29, 18.29, 18.29, 18.29, 18.29, 18.29, 18.29, 18.3, 18.3, 18.3, 18.3, 18.3, 18.29, 18.29, 18.29, 18.29, 18.29, 18.19, 18.19, 18.19, 18.19, 18.19, 18.18, 18.18, 18.18, 18.18, 18.18, 18.22, 18.22, 18.22, 18.22, 18.22, 18.33, 18.33, 18.33, 18.33, 18.33, 18.37, 18.37, 18.37, 18.37, 18.37, 18.32, 18.32, 18.32, 18.32, 18.32, 18.28, 18.28, 18.28, 18.28, 18.28, 18.2, 18.2, 18.2, 18.2, 18.2, 17.98, 17.98, 17.98, 17.98, 17.98, 17.77, 17.77, 17.77, 17.77, 17.77, 17.51, 17.51, 17.51, 17.51, 17.51, 17.35, 17.35, 17.35, 17.35, 17.35, 17.33, 17.33, 17.33, 17.33, 17.33, 17.37, 17.37, 17.37, 17.37, 17.37, 17.44, 17.44, 17.44, 17.44, 17.44, 17.46, 17.46, 17.46, 17.46, 17.46, 17.5, 17.5, 17.5, 17.5, 17.5, 17.53, 17.53, 17.53, 17.53, 17.53, 17.53, 17.53, 17.53, 17.53, 17.53, 17.5, 17.5, 17.5, 17.5, 17.5, 17.48, 17.48, 17.48, 17.48, 17.48, 17.46, 17.46, 17.46, 17.46, 17.46, 17.52, 17.52, 17.52]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 504 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711883059 --> 1711883683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09, 0.09, 0.09, 0.09, 0.09, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.3, 0.3, 0.3, 0.3, 0.3, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.07, 0.07, 0.07, 0.07, 0.07, 0.25, 0.25, 0.25, 0.25, 0.25, 0.41, 0.41, 0.41, 0.41, 0.41, 0.46, 0.46, 0.46, 0.46, 0.46, 0.5, 0.5, 0.5, 0.5, 0.5, 0.52, 0.52, 0.52, 0.52, 0.52, 0.54, 0.54, 0.54, 0.54, 0.54, 0.3, 0.3, 0.3, 0.3, 0.3, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 504 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711883059 --> 1711883683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0]
                    

@slaren
Copy link
Collaborator

slaren commented Mar 31, 2024

@slaren Since we can put SYCL related code under a directory instead of a single file, I might introduce headers-only library for performance optimization, as well as simplifying our effort too (my job during work time 😁 )

I think that's good, I plan to start using CUTLASS in the CUDA backend as well.

@airMeng airMeng changed the title [SYCL refactor [SYCL] refactor Apr 1, 2024
@NeoZhangJianyu
Copy link
Collaborator

It's great to see the new structure.
Current SYCL backend has bugs which impact IQ3 model and UT pass rate is dropped too.
I'm working to fix them now.
Is it possible to wait for my fix?

@abhilash1910
Copy link
Collaborator

This is a good refactoring, and would be helpful for debug . I would suggest waiting for some iq quant prs and then resume work on this.

@airMeng
Copy link
Collaborator Author

airMeng commented Apr 1, 2024

This is a good refactoring, and would be helpful for debug . I would suggest waiting for some iq quant prs and then resume work on this.

It's great to see the new structure. Current SYCL backend has bugs which impact IQ3 model and UT pass rate is dropped too. I'm working to fix them now. Is it possible to wait for my fix?

yes, drop a note when you finished.

@NeoZhangJianyu
Copy link
Collaborator

@airMeng
All IQ types in this PR are supported/fixed by #6521.
You could continue your work now.

Thank you!

@airMeng airMeng marked this pull request as draft April 25, 2024 14:04
@airMeng airMeng marked this pull request as ready for review April 30, 2024 09:37
@airMeng
Copy link
Collaborator Author

airMeng commented May 5, 2024

ggml-sycl.cpp Outdated Show resolved Hide resolved
ggml-sycl.cpp Outdated Show resolved Hide resolved
@NeoZhangJianyu
Copy link
Collaborator

  1. Build with fp16 is fault, please check and fix.
  2. Please run ci/run.sh to make sure the quality not be reduced.

@NeoZhangJianyu
Copy link
Collaborator

for sub folder: dpct
I suggest not to use folder for "dpct". save them to two file in ggml-sycl foder, like dpct-helper.cpp/hpp.

  1. There won't be more files in dpct part. no need to add a subfolder for 2 files.
  2. The dpct files are updated for llama.cpp requirement manfully.
    save them to dpct folder, will make other think it's copied from dcpt directly.

@mofosyne mofosyne added review complexity : high Generally require indepth knowledge of LLMs or GPUs enhancement New feature or request labels May 10, 2024
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 22, 2024
Copy link
Contributor

github-actions bot commented May 22, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 545 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8562.08ms p(95)=21241.45ms fails=, finish reason: stop=483 truncated=62
  • Prompt processing (pp): avg=100.34tk/s p(95)=436.58tk/s
  • Token generation (tg): avg=34.62tk/s p(95)=48.49tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sycl-refactor commit=50dffa13d8f947a077a03478aaf26dc70bdc7ecd

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716792903 --> 1716793529
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 260.58, 260.58, 260.58, 260.58, 260.58, 529.01, 529.01, 529.01, 529.01, 529.01, 580.94, 580.94, 580.94, 580.94, 580.94, 604.47, 604.47, 604.47, 604.47, 604.47, 666.81, 666.81, 666.81, 666.81, 666.81, 702.25, 702.25, 702.25, 702.25, 702.25, 706.59, 706.59, 706.59, 706.59, 706.59, 722.98, 722.98, 722.98, 722.98, 722.98, 734.39, 734.39, 734.39, 734.39, 734.39, 748.38, 748.38, 748.38, 748.38, 748.38, 751.73, 751.73, 751.73, 751.73, 751.73, 776.2, 776.2, 776.2, 776.2, 776.2, 809.65, 809.65, 809.65, 809.65, 809.65, 826.3, 826.3, 826.3, 826.3, 826.3, 799.07, 799.07, 799.07, 799.07, 799.07, 805.11, 805.11, 805.11, 805.11, 805.11, 806.56, 806.56, 806.56, 806.56, 806.56, 828.54, 828.54, 828.54, 828.54, 828.54, 826.27, 826.27, 826.27, 826.27, 826.27, 828.39, 828.39, 828.39, 828.39, 828.39, 835.14, 835.14, 835.14, 835.14, 835.14, 837.46, 837.46, 837.46, 837.46, 837.46, 839.34, 839.34, 839.34, 839.34, 839.34, 833.72, 833.72, 833.72, 833.72, 833.72, 835.88, 835.88, 835.88, 835.88, 835.88, 833.6, 833.6, 833.6, 833.6, 833.6, 829.17, 829.17, 829.17, 829.17, 829.17, 826.01, 826.01, 826.01, 826.01, 826.01, 826.41, 826.41, 826.41, 826.41, 826.41, 825.1, 825.1, 825.1, 825.1, 825.1, 830.84, 830.84, 830.84, 830.84, 830.84, 830.68, 830.68, 830.68, 830.68, 830.68, 831.17, 831.17, 831.17, 831.17, 831.17, 830.41, 830.41, 830.41, 830.41, 830.41, 837.67, 837.67, 837.67, 837.67, 837.67, 844.09, 844.09, 844.09, 844.09, 844.09, 829.71, 829.71, 829.71, 829.71, 829.71, 828.34, 828.34, 828.34, 828.34, 828.34, 825.44, 825.44, 825.44, 825.44, 825.44, 828.53, 828.53, 828.53, 828.53, 828.53, 831.82, 831.82, 831.82, 831.82, 831.82, 842.69, 842.69, 842.69, 842.69, 842.69, 849.9, 849.9, 849.9, 849.9, 849.9, 850.11, 850.11, 850.11, 850.11, 850.11, 848.99, 848.99, 848.99, 848.99, 848.99, 847.99, 847.99, 847.99, 847.99, 847.99, 848.54, 848.54, 848.54, 848.54, 848.54, 854.38, 854.38, 854.38, 854.38, 854.38, 854.0, 854.0, 854.0, 854.0, 854.0, 859.98, 859.98, 859.98, 859.98, 859.98, 858.29, 858.29, 858.29, 858.29, 858.29, 862.63, 862.63, 862.63, 862.63, 862.63, 865.27, 865.27, 865.27, 865.27, 865.27, 864.52, 864.52, 864.52, 864.52, 864.52, 870.62, 870.62, 870.62, 870.62, 870.62, 869.74, 869.74, 869.74, 869.74, 869.74, 870.07, 870.07, 870.07, 870.07, 870.07, 870.66, 870.66, 870.66, 870.66, 870.66, 870.74, 870.74, 870.74, 870.74, 870.74, 872.29, 872.29, 872.29, 872.29, 872.29, 875.08, 875.08, 875.08, 875.08]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716792903 --> 1716793529
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.13, 30.13, 30.13, 30.13, 30.13, 27.02, 27.02, 27.02, 27.02, 27.02, 27.75, 27.75, 27.75, 27.75, 27.75, 28.89, 28.89, 28.89, 28.89, 28.89, 30.25, 30.25, 30.25, 30.25, 30.25, 32.6, 32.6, 32.6, 32.6, 32.6, 33.76, 33.76, 33.76, 33.76, 33.76, 33.96, 33.96, 33.96, 33.96, 33.96, 34.06, 34.06, 34.06, 34.06, 34.06, 33.46, 33.46, 33.46, 33.46, 33.46, 33.69, 33.69, 33.69, 33.69, 33.69, 33.24, 33.24, 33.24, 33.24, 33.24, 32.91, 32.91, 32.91, 32.91, 32.91, 32.47, 32.47, 32.47, 32.47, 32.47, 32.17, 32.17, 32.17, 32.17, 32.17, 32.46, 32.46, 32.46, 32.46, 32.46, 32.48, 32.48, 32.48, 32.48, 32.48, 32.05, 32.05, 32.05, 32.05, 32.05, 31.74, 31.74, 31.74, 31.74, 31.74, 31.47, 31.47, 31.47, 31.47, 31.47, 31.54, 31.54, 31.54, 31.54, 31.54, 31.63, 31.63, 31.63, 31.63, 31.63, 31.33, 31.33, 31.33, 31.33, 31.33, 31.51, 31.51, 31.51, 31.51, 31.51, 31.77, 31.77, 31.77, 31.77, 31.77, 31.88, 31.88, 31.88, 31.88, 31.88, 31.64, 31.64, 31.64, 31.64, 31.64, 31.17, 31.17, 31.17, 31.17, 31.17, 31.26, 31.26, 31.26, 31.26, 31.26, 31.43, 31.43, 31.43, 31.43, 31.43, 31.63, 31.63, 31.63, 31.63, 31.63, 31.87, 31.87, 31.87, 31.87, 31.87, 31.9, 31.9, 31.9, 31.9, 31.9, 31.79, 31.79, 31.79, 31.79, 31.79, 31.62, 31.62, 31.62, 31.62, 31.62, 31.6, 31.6, 31.6, 31.6, 31.6, 31.58, 31.58, 31.58, 31.58, 31.58, 31.49, 31.49, 31.49, 31.49, 31.49, 31.65, 31.65, 31.65, 31.65, 31.65, 31.8, 31.8, 31.8, 31.8, 31.8, 31.9, 31.9, 31.9, 31.9, 31.9, 31.69, 31.69, 31.69, 31.69, 31.69, 31.3, 31.3, 31.3, 31.3, 31.3, 30.92, 30.92, 30.92, 30.92, 30.92, 30.43, 30.43, 30.43, 30.43, 30.43, 29.76, 29.76, 29.76, 29.76, 29.76, 29.78, 29.78, 29.78, 29.78, 29.78, 29.83, 29.83, 29.83, 29.83, 29.83, 30.0, 30.0, 30.0, 30.0, 30.0, 30.08, 30.08, 30.08, 30.08, 30.08, 30.16, 30.16, 30.16, 30.16, 30.16, 30.17, 30.17, 30.17, 30.17, 30.17, 29.98, 29.98, 29.98, 29.98, 29.98, 29.93, 29.93, 29.93, 29.93, 29.93, 29.9, 29.9, 29.9, 29.9, 29.9, 30.02, 30.02, 30.02, 30.02, 30.02, 30.18, 30.18, 30.18, 30.18, 30.18, 30.25, 30.25, 30.25, 30.25, 30.25, 30.34, 30.34, 30.34, 30.34, 30.34, 30.38, 30.38, 30.38, 30.38]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716792903 --> 1716793529
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.35, 0.35, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31, 0.31, 0.31, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.47, 0.47, 0.47, 0.47, 0.47, 0.57, 0.57, 0.57, 0.57, 0.57, 0.52, 0.52, 0.52, 0.52, 0.52, 0.5, 0.5, 0.5, 0.5, 0.5, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716792903 --> 1716793529
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0]
                    

@mofosyne
Copy link
Collaborator

There was a lot of changes which caused conflict. Unable to see how to easily resolve it. @airMeng can you see if there is much that needs to be fixed?

@airMeng
Copy link
Collaborator Author

airMeng commented May 22, 2024

There was a lot of changes which caused conflict. Unable to see how to easily resolve it. @airMeng can you see if there is much that needs to be fixed?

never mind. just more time needed.

@mofosyne mofosyne marked this pull request as draft May 22, 2024 09:52
seperate mmq, mmvq, dmmv from the main files

avoid g_sycl_gpu_mgr null

fix no new line

fix backend no new line

fix fp16 issues

no final newlines

backup

backup

backup

backup

backup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues enhancement New feature or request ggml changes relating to the ggml tensor library for machine learning review complexity : high Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants