Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf : add special tokens metadata for FIM/Infill #6689

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

danbev
Copy link
Contributor

@danbev danbev commented Apr 15, 2024

This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model.

The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models.

This commit adds special token metadata for Fill-In-the-Middle
(FIM)/Infill to the GGUF model.

The motivation for this is that currently there is support for CodeLlama
but other models exist now like CodeGemma, but the different models use
different token ids for the special tokens and this commit allows for
supporting multiple models.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 459 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10270.35ms p(95)=25858.35ms fails=, finish reason: stop=407 truncated=52
  • Prompt processing (pp): avg=114.97tk/s p(95)=510.44tk/s
  • Token generation (tg): avg=26.17tk/s p(95)=36.67tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=infill-metadata commit=021baca34a9c6b3683b7f3ffaf6de0de0d09198d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 459 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713192967 --> 1713193597
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 485.7, 485.7, 485.7, 485.7, 485.7, 435.41, 435.41, 435.41, 435.41, 435.41, 451.34, 451.34, 451.34, 451.34, 451.34, 461.42, 461.42, 461.42, 461.42, 461.42, 484.52, 484.52, 484.52, 484.52, 484.52, 517.23, 517.23, 517.23, 517.23, 517.23, 535.26, 535.26, 535.26, 535.26, 535.26, 535.83, 535.83, 535.83, 535.83, 535.83, 539.87, 539.87, 539.87, 539.87, 539.87, 564.15, 564.15, 564.15, 564.15, 564.15, 565.09, 565.09, 565.09, 565.09, 565.09, 578.57, 578.57, 578.57, 578.57, 578.57, 581.43, 581.43, 581.43, 581.43, 581.43, 585.78, 585.78, 585.78, 585.78, 585.78, 602.85, 602.85, 602.85, 602.85, 602.85, 626.68, 626.68, 626.68, 626.68, 626.68, 640.42, 640.42, 640.42, 640.42, 640.42, 651.24, 651.24, 651.24, 651.24, 651.24, 652.55, 652.55, 652.55, 652.55, 652.55, 659.01, 659.01, 659.01, 659.01, 659.01, 658.92, 658.92, 658.92, 658.92, 658.92, 673.19, 673.19, 673.19, 673.19, 673.19, 671.5, 671.5, 671.5, 671.5, 671.5, 671.1, 671.1, 671.1, 671.1, 671.1, 670.57, 670.57, 670.57, 670.57, 670.57, 675.15, 675.15, 675.15, 675.15, 675.15, 675.74, 675.74, 675.74, 675.74, 675.74, 679.99, 679.99, 679.99, 679.99, 679.99, 689.2, 689.2, 689.2, 689.2, 689.2, 655.93, 655.93, 655.93, 655.93, 655.93, 658.99, 658.99, 658.99, 658.99, 658.99, 666.92, 666.92, 666.92, 666.92, 666.92, 670.41, 670.41, 670.41, 670.41, 670.41, 670.3, 670.3, 670.3, 670.3, 670.3, 669.97, 669.97, 669.97, 669.97, 669.97, 673.32, 673.32, 673.32, 673.32, 673.32, 677.24, 677.24, 677.24, 677.24, 677.24, 676.89, 676.89, 676.89, 676.89, 676.89, 677.68, 677.68, 677.68, 677.68, 677.68, 683.88, 683.88, 683.88, 683.88, 683.88, 692.34, 692.34, 692.34, 692.34, 692.34, 691.83, 691.83, 691.83, 691.83, 691.83, 696.74, 696.74, 696.74, 696.74, 696.74, 696.23, 696.23, 696.23, 696.23, 696.23, 694.9, 694.9, 694.9, 694.9, 694.9, 694.62, 694.62, 694.62, 694.62, 694.62, 697.27, 697.27, 697.27, 697.27, 697.27, 699.78, 699.78, 699.78, 699.78, 699.78, 704.1, 704.1, 704.1, 704.1, 704.1, 697.6, 697.6, 697.6, 697.6, 697.6, 694.16, 694.16, 694.16, 694.16, 694.16, 690.29, 690.29, 690.29, 690.29, 690.29, 683.93, 683.93, 683.93, 683.93, 683.93, 682.45, 682.45, 682.45, 682.45, 682.45, 682.11, 682.11, 682.11, 682.11, 682.11, 675.57, 675.57, 675.57, 675.57, 675.57, 675.99, 675.99, 675.99, 675.99, 675.99, 675.37, 675.37, 675.37, 675.37, 675.37, 680.07, 680.07, 680.07, 680.07, 680.07, 675.76, 675.76, 675.76, 675.76, 675.76, 678.98, 678.98, 678.98, 678.98, 678.98, 678.98]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 459 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713192967 --> 1713193597
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.61, 33.61, 33.61, 33.61, 33.61, 32.69, 32.69, 32.69, 32.69, 32.69, 28.52, 28.52, 28.52, 28.52, 28.52, 25.65, 25.65, 25.65, 25.65, 25.65, 25.02, 25.02, 25.02, 25.02, 25.02, 24.11, 24.11, 24.11, 24.11, 24.11, 23.84, 23.84, 23.84, 23.84, 23.84, 23.62, 23.62, 23.62, 23.62, 23.62, 23.81, 23.81, 23.81, 23.81, 23.81, 24.53, 24.53, 24.53, 24.53, 24.53, 24.68, 24.68, 24.68, 24.68, 24.68, 24.79, 24.79, 24.79, 24.79, 24.79, 24.66, 24.66, 24.66, 24.66, 24.66, 24.45, 24.45, 24.45, 24.45, 24.45, 24.35, 24.35, 24.35, 24.35, 24.35, 24.16, 24.16, 24.16, 24.16, 24.16, 23.72, 23.72, 23.72, 23.72, 23.72, 23.29, 23.29, 23.29, 23.29, 23.29, 22.85, 22.85, 22.85, 22.85, 22.85, 22.94, 22.94, 22.94, 22.94, 22.94, 23.01, 23.01, 23.01, 23.01, 23.01, 23.13, 23.13, 23.13, 23.13, 23.13, 22.85, 22.85, 22.85, 22.85, 22.85, 22.59, 22.59, 22.59, 22.59, 22.59, 22.31, 22.31, 22.31, 22.31, 22.31, 22.11, 22.11, 22.11, 22.11, 22.11, 22.04, 22.04, 22.04, 22.04, 22.04, 22.16, 22.16, 22.16, 22.16, 22.16, 22.21, 22.21, 22.21, 22.21, 22.21, 22.29, 22.29, 22.29, 22.29, 22.29, 22.38, 22.38, 22.38, 22.38, 22.38, 22.51, 22.51, 22.51, 22.51, 22.51, 22.4, 22.4, 22.4, 22.4, 22.4, 22.31, 22.31, 22.31, 22.31, 22.31, 22.34, 22.34, 22.34, 22.34, 22.34, 22.68, 22.68, 22.68, 22.68, 22.68, 22.77, 22.77, 22.77, 22.77, 22.77, 22.79, 22.79, 22.79, 22.79, 22.79, 22.9, 22.9, 22.9, 22.9, 22.9, 23.04, 23.04, 23.04, 23.04, 23.04, 23.0, 23.0, 23.0, 23.0, 23.0, 23.0, 23.0, 23.0, 23.0, 23.0, 22.93, 22.93, 22.93, 22.93, 22.93, 22.67, 22.67, 22.67, 22.67, 22.67, 22.62, 22.62, 22.62, 22.62, 22.62, 22.67, 22.67, 22.67, 22.67, 22.67, 22.73, 22.73, 22.73, 22.73, 22.73, 22.89, 22.89, 22.89, 22.89, 22.89, 22.93, 22.93, 22.93, 22.93, 22.93, 22.91, 22.91, 22.91, 22.91, 22.91, 22.77, 22.77, 22.77, 22.77, 22.77, 22.65, 22.65, 22.65, 22.65, 22.65, 22.08, 22.08, 22.08, 22.08, 22.08, 22.02, 22.02, 22.02, 22.02, 22.02, 21.78, 21.78, 21.78, 21.78, 21.78, 21.44, 21.44, 21.44, 21.44, 21.44, 21.38, 21.38, 21.38, 21.38, 21.38, 21.44, 21.44, 21.44, 21.44, 21.44, 21.47, 21.47, 21.47, 21.47, 21.47, 21.6, 21.6, 21.6, 21.6, 21.6, 21.63, 21.63, 21.63, 21.63, 21.63, 21.74]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 459 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713192967 --> 1713193597
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.29, 0.29, 0.29, 0.29, 0.29, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.3, 0.3, 0.3, 0.3, 0.3, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.41, 0.41, 0.41, 0.41, 0.41, 0.52, 0.52, 0.52, 0.52, 0.52, 0.43, 0.43, 0.43, 0.43, 0.43, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 459 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713192967 --> 1713193597
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0]
                    

@ggerganov ggerganov merged commit 4fbd809 into ggerganov:master Apr 16, 2024
63 checks passed
@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 16, 2024

This commit breaks model compatibility. I've been experimenting with train-text-from-scratch and the last commit that operates as expected is commit 7593639c.

git log --pretty --oneline 132f5579..HEAD  
dbceec87 (HEAD -> master, origin/master, origin/HEAD) llama : add StableLM2 12B (#6635)
f4dea7da llama : add qwen2moe (#6074)
8a56075b gritlm : add --outdir option to hf.sh script (#6699)
58227ffd perplexity : require positive --ctx-size arg (#6695)
4fbd8098 (infill-metadata) gguf : add special tokens metadata for FIM/Infill (#6689)
7593639c (stable) `main`: add --json-schema / -j flag (#6659)

main ends up looking for general.name which isn't available.

./main -m models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf --color -e -s 1337 -c 4096 -n 256 --n-gpu-layers 16 -p "When forty winters shall besiege thy brow,"
Log start
main: build = 2680 (4fbd8098)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 20 key-value pairs and 147 tensors from models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                          general.file_type u32              = 0
llama_model_loader: - kv   2:                       llama.context_length u32              = 64
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 256
llama_model_loader: - kv   4:                  llama.feed_forward_length u32              = 768
llama_model_loader: - kv   5:                 llama.attention.head_count u32              = 8
llama_model_loader: - kv   6:                          llama.block_count u32              = 16
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 32
llama_model_loader: - kv   8:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  10:                    llama.rope.scale_linear f32              = 1.000000
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:          tokenizer.ggml.seperator_token_id u32              = 4294967295
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 4294967295
llama_model_loader: - type  f32:  147 tensors
llama_model_load: error loading model: error loading model vocabulary: key not found in model: general.name
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf'
main: error: unable to load model

I think this is due to the way the vocabulary was modified which has always supported the llama architecture.

I'm using the mistralai vocab I generated with convert.py.

python gguf-py/scripts/gguf-dump.py models/ggml-vocab-mistral.gguf
* Loading: models/ggml-vocab-mistral.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 25 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 0
      3: UINT64     |        1 | GGUF.kv_count = 22
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'mistralai'
      6: UINT32     |        1 | llama.vocab_size = 32000
      7: UINT32     |        1 | llama.context_length = 32768
      8: UINT32     |        1 | llama.embedding_length = 4096
      9: UINT32     |        1 | llama.block_count = 32
     10: UINT32     |        1 | llama.feed_forward_length = 14336
     11: UINT32     |        1 | llama.rope.dimension_count = 128
     12: UINT32     |        1 | llama.attention.head_count = 32
     13: UINT32     |        1 | llama.attention.head_count_kv = 8
     14: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     15: FLOAT32    |        1 | llama.rope.freq_base = 1000000.0
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    32000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    32000 | tokenizer.ggml.scores
     19: [INT32]    |    32000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     21: UINT32     |        1 | tokenizer.ggml.eos_token_id = 2
     22: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 0
     23: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     24: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     25: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"

* Dumping 0 tensor(s)

This commit changed the special vocabulary ids. Haven't dug into too deep. Still looking into it.

// CodeGemma (LLM_ARCH_GEMMA). This can potentially be removed once
// new versions of these models have been published.
std::string gen_name;
ml.get_key(LLM_KV_GENERAL_NAME, gen_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's lines 4083 - 4106 that are causing the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples/train-text-from-scratch/train-text-from-scratch.cpp doesn't rely on or use LLM_KV_GENERAL_NAME, so that's why I'm able to train, but not inference. This most likely has other unintended side-effects due to the implementation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does #6709 fix the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so.

ml.get_key(LLM_KV_GENERAL_NAME, gen_name, false);

It seems like setting the required parameter to false did the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teleprint-me Sorry about causing this and wasting your time. And thanks @ggerganov for fixing my mistake!

@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 16, 2024

PR #6709 fixed it. I'm able to run the latest code with this change. I tested another custom model I've been tinkering with and its working again. Might be a good idea to add "general.name" to train-text-from-scratch.cpp. I'll see if I can add it in another PR if that's alright.

tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024
This commit adds special token metadata for Fill-In-the-Middle
(FIM)/Infill to the GGUF model.

The motivation for this is that currently there is support for CodeLlama
but other models exist now like CodeGemma, but the different models use
different token ids for the special tokens and this commit allows for
supporting multiple models.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
@NightMachinery
Copy link

How does llama.cpp know the FIM prompt template for each model? Does it just assume the template FIM_START prefix FIM_SUFFIX suffix FIM_COMPLETE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants