Skip to content
/ tok Public

❗ This is a read-only mirror of the CRAN R package repository. tok — Fast Text Tokenization. Homepage: https://github.com/mlverse/tok Report bugs for this package: https://github.com/mlverse/tok/issues

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.note
Notifications You must be signed in to change notification settings

cran/tok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tok

R build status CRAN status

tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.

We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.

Installation

You can install tok from CRAN using:

install.packages("tok")

Installing tok from source requires working Rust toolchain. We recommend using rustup.

On Windows, you’ll also have to add the i686-pc-windows-gnu and x86_64-pc-windows-gnu targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

remotes::install_github("dfalbel/tok")

Features

We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.

Loading tokenizers

tok can be used to load and use tokenizers that have been previously serialized. For example, HuggingFace model weights are usually accompanied by a ‘tokenizer.json’ file that can be loaded with this library.

To load a pre-trained tokenizer from a json file, use:

path <- testthat::test_path("assets/tokenizer.json")
tok <- tok::tokenizer$from_file(path)

Use the encode method to tokenize sentendes and decode to transform them back.

enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

Using pre-trained tokenizers

You can also load any tokenizer available in HuggingFace hub by using the from_pretrained static method. For example, let’s load the GPT2 tokenizer with:

tok <- tok::tokenizer$from_pretrained("gpt2")
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

About

❗ This is a read-only mirror of the CRAN R package repository. tok — Fast Text Tokenization. Homepage: https://github.com/mlverse/tok Report bugs for this package: https://github.com/mlverse/tok/issues

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.note

Stars

Watchers

Forks

Packages

No packages published

Languages