Skip to content

py : fix missing added_tokens_dict for SPM and BPE vocabs#4971

Merged
ggerganov merged 4 commits into
masterfrom
gg/fix-spm-added-tokens-dict-4958
Jan 17, 2024
Merged

py : fix missing added_tokens_dict for SPM and BPE vocabs#4971
ggerganov merged 4 commits into
masterfrom
gg/fix-spm-added-tokens-dict-4958

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Jan 16, 2024

fix: #4958 #4925

@ggerganov ggerganov force-pushed the gg/fix-spm-added-tokens-dict-4958 branch from 9aefd14 to a137273 Compare January 16, 2024 12:08
@ggerganov ggerganov changed the title py : fix missing added_tokens_dict for SPM vocab py : fix missing added_tokens_dict for SPM and BPE vocabs Jan 16, 2024
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Jan 16, 2024
@TheBloke
Copy link
Copy Markdown
Contributor

Confirming this now works, as per my comment: #4958 (comment)

Many thanks

@ggerganov ggerganov merged commit 4f4bf35 into master Jan 17, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need feedback Testing and feedback with results are needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert.py: --pad-vocab not working with SPM, 'SentencePieceVocab' object has no attribute 'added_tokens_dict'. Did you mean: 'added_tokens_list'?

2 participants