Skip to content

danieldk/curated-tokenizers

 
 

Repository files navigation

🥢 Curated Tokenizers

This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:

Tokenizer Binding Example model
BPE sentencepiece
Byte BPE Native RoBERTa/GPT-2
Unigram sentencepiece XLM-RoBERTa
Wordpiece Native BERT

⚠️ Warning: experimental package

This package is experimental and it is likely that the APIs will change in incompatible ways.

⏳ Install

Curated tokenizers is availble through PyPI:

pip install curated_tokenizers

🚀 Quickstart

The best way to get started with curated tokenizers is through the curated-transformers library. curated-transformers also provides functionality to load tokenization models from Huggingface Hub.

About

Lightweight piece tokenization library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Cython 51.3%
  • Python 35.9%
  • C++ 12.5%
  • C 0.3%