Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ByteBPEProcessor #21

Merged
merged 8 commits into from
Oct 14, 2022
Merged

Add ByteBPEProcessor #21

merged 8 commits into from
Oct 14, 2022

Conversation

danieldk
Copy link
Contributor

This type of processor applies byte-level BPE encoding. The processor aims for compatibility with RoBERTa/GPT-2 BPE vocabs.

Fixes #19.

This type of processor applies byte-level BPE encoding. The processor aims for
compatibility with RoBERTa/GPT-2 BPE vocabs.

Fixes #19.
@danieldk danieldk requested a review from shadeMe October 13, 2022 07:34
cutlery/_bbpe.pyx Show resolved Hide resolved
version = f.readline()
if not version.startswith("#version: 0.2"):
raise ValueError(f"Only version 0.2 of the merges format is supported, was: {version.strip()}")
merges = [tuple(line.strip().split(" ")) for line in f]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a sanity check to confirm that the splits have a length of 2?

cutlery/_bbpe.pyx Outdated Show resolved Hide resolved
cutlery/_bbpe.pyx Outdated Show resolved Hide resolved
cutlery/_bbpe.pyx Show resolved Hide resolved
cutlery/merges.cc Outdated Show resolved Hide resolved
cutlery/merges.hh Outdated Show resolved Hide resolved
cutlery/merges.cc Outdated Show resolved Hide resolved
cutlery/merges.cc Outdated Show resolved Hide resolved
cutlery/merges.cc Outdated Show resolved Hide resolved
@shadeMe shadeMe merged commit dd30cfc into main Oct 14, 2022
@shadeMe shadeMe deleted the bpe branch October 14, 2022 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for BPE
2 participants