Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port cutlery to Rust and pyo3 #20

Closed
wants to merge 2 commits into from
Closed

Port cutlery to Rust and pyo3 #20

wants to merge 2 commits into from

Conversation

danieldk
Copy link
Contributor

Benefits:

  • Safety of Rust implementations.
  • Reuse existing sentencepiece and wordpieces crates that are well-covered by tests and have been used to process billions of tokens already.
  • Makes future BPE implementation much easier.

Benefits:

* Safety of Rust implementations.
* Reuse existing `sentencepiece` and `wordpieces` crates that are
  well-covered by tests and have been used to process billions of
  tokens already.
* Makes future BPE implementation much easier.
pyproject.toml Outdated
requires = ["maturin>=0.13,<0.14"]
build-backend = "maturin"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's a particular reason to use maturin, I think that setuptools-rust will be easier to use with wheelwright, and I've had problems in the past getting maturin to generate working sdists (but this was a while ago, and partially related to relative source paths, I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone back and forth a few times, but I don't know why I ended up with maturin, so I'll try to convert it to setuptools-rust.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to modernize wheelwright a bit at some point, but for now it's still using python setup.py sdist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the main thing is "working sdist" part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think sdists should work. I have been testing with tox and AFAIK tox attempts to build from an sdist.

Copy link
Contributor Author

@danieldk danieldk Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that implementing things like BPE (which I am working on now) really gets terrible in Cython, because you need binary heaps, string views, maps with pairs, etc. to do it efficiently. So you end up with Cython/C++ template soup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version doesn't work on my end at least.

Still working on things, it's a draft, not really ready yet to try out or review. Just wanted to do some CI runs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I tried out was building from the sdist. I wanted something that was easier to build than the existing sentencepiece, otherwise this seems like it's starting to having fewer advantages over tokenizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I'll rethink a bit how to do BPE. I don't want to do it in Python, because it will be too slow. But I think in template-heavy Cython is also messy. Maybe I'll do it in pure C++ and call that from Cython.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure C++ is fine with me. I'm more worried about more rust dependencies. It's just so hard for users from the python side.

@danieldk danieldk closed this Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants