-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port cutlery to Rust and pyo3 #20
Conversation
Benefits: * Safety of Rust implementations. * Reuse existing `sentencepiece` and `wordpieces` crates that are well-covered by tests and have been used to process billions of tokens already. * Makes future BPE implementation much easier.
pyproject.toml
Outdated
requires = ["maturin>=0.13,<0.14"] | ||
build-backend = "maturin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's a particular reason to use maturin, I think that setuptools-rust
will be easier to use with wheelwright, and I've had problems in the past getting maturin to generate working sdists (but this was a while ago, and partially related to relative source paths, I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone back and forth a few times, but I don't know why I ended up with maturin, so I'll try to convert it to setuptools-rust
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to modernize wheelwright a bit at some point, but for now it's still using python setup.py sdist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the main thing is "working sdist" part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think sdists should work. I have been testing with tox
and AFAIK tox
attempts to build from an sdist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that implementing things like BPE (which I am working on now) really gets terrible in Cython, because you need binary heaps, string views, maps with pairs, etc. to do it efficiently. So you end up with Cython/C++ template soup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version doesn't work on my end at least.
Still working on things, it's a draft, not really ready yet to try out or review. Just wanted to do some CI runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I tried out was building from the sdist. I wanted something that was easier to build than the existing sentencepiece
, otherwise this seems like it's starting to having fewer advantages over tokenizers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. I'll rethink a bit how to do BPE. I don't want to do it in Python, because it will be too slow. But I think in template-heavy Cython is also messy. Maybe I'll do it in pure C++ and call that from Cython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure C++ is fine with me. I'm more worried about more rust dependencies. It's just so hard for users from the python side.
Benefits:
sentencepiece
andwordpieces
crates that are well-covered by tests and have been used to process billions of tokens already.