Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements in GPT2 tokenizer #3567

Merged
merged 2 commits into from
Oct 10, 2023
Merged

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Oct 10, 2023

Fixes our bugs in #3502.

@cmp-nct : the small fixes in bpe_gpt2_preprocess could be relevant for you?

@goerch goerch requested a review from ggerganov October 10, 2023 05:34
@goerch goerch changed the title Minor improvements for #3502 Minor improvements in GPT2 tokenizer Oct 10, 2023
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make the tokenization of wikitext match with the Python, or is there still the 4-digit tokenization discrepancy?

@goerch
Copy link
Collaborator Author

goerch commented Oct 10, 2023

Does this make the tokenization of wikitext match with the Python, or is there still the 4-digit tokenization discrepancy?

The 4-digit tokenization discrepancy could be a bug in HF fast tokenizers.

@ggerganov
Copy link
Owner

Thank you for the thorough investigation. After @cebtenzzre's review we can merge this

@apage43
Copy link
Contributor

apage43 commented Oct 10, 2023

remaining discrepancy from HF is caused by different pretokenzation: #3502 (comment)

@cebtenzzre cebtenzzre removed their request for review October 10, 2023 16:56
@goerch goerch merged commit 233fc1c into ggerganov:master Oct 10, 2023
39 checks passed
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 12, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (34 commits)
  examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436)
  docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597)
  cmake : fix add_compile_options on macOS
  typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592)
  ci : check if there is enough VRAM (ggerganov#3596)
  server : add completion mode (no chat) (ggerganov#3582)
  prompts : add mnemonics.txt
  server : fix kv cache management (ggerganov#3588)
  main : fix session loading bug (ggerganov#3400)
  server : add parameter -tb N, --threads-batch N (ggerganov#3584)
  common : fix mirostat state when using multiple sequences (ggerganov#3543)
  batched : add bench tool (ggerganov#3545)
  examples : add batched.swift + improve CI for swift (ggerganov#3562)
  Add MPT model to supported models in README.md (ggerganov#3574)
  Minor improvements in GPT2 tokenizer (ggerganov#3567)
  readme : add bloom (ggerganov#3570)
  llm : add bloom models (ggerganov#3553)
  swift : improvements and fixes (ggerganov#3564)
  llm : add MPT support (ggerganov#3417)
  infill. : fix tokenization (ggerganov#3508)
  ...
@goerch goerch deleted the fix-3502 branch October 22, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants