Tokenizer is a Rust project inspired by OpenAI, aimed at providing a basic implementation of the Byte Pair Encoding (BPE) algorithm. This project serves as a learning opportunity for Rust enthusiasts, particularly those interested in the field of artificial intelligence.
Byte Pair Encoding (BPE) is a popular technique used in natural language processing (NLP) tasks, particularly in tokenization. It involves iteratively merging the most frequent pair of symbols in a corpus, effectively learning subword units that are useful for various NLP tasks.
Tokenizer implements the BPE algorithm in Rust, providing a foundation for further exploration and experimentation in tokenization and NLP.
- BPE Implementation: Provides a basic implementation of the Byte Pair Encoding algorithm in Rust.
- Extensible: Designed with modularity in mind, allowing for easy expansion with additional modules for different tokenization techniques, such as GPT-2 tokenization.
- MIT License: Released under the MIT License, enabling anyone to use, modify, and distribute the project freely.
-
Clone the Repository:
git clone https://github.com/usama3627/tokenizer.git
-
Install Rust: Ensure that you have Rust installed on your system. You can install it using rustup.
-
Build and Run:
cd tokenizer cargo build cargo run
-
Training Data: To train the tokenizer, download the training dataset from the provided Link (huggingface dataset) and rename it to
myresponse.json
. Place the dataset in the project directory. For testing, I am using 100 rows of tweets.
- serde = { version = "1.0.104", features = ["derive"] }
- serde_json = "1.0.48"
Ensure these dependencies are specified in your Cargo.toml
file.
In future iterations of the project, the following enhancements can be considered:
Modularization: Refactor the code into modules, separating concerns such as BPE implementation and GPT-2 tokenization.- Optimizations: Explore optimizations to improve the performance of the tokenization process.
- Documentation: Enhance documentation to provide detailed explanations of the algorithms and codebase.
- Additional Tokenization Techniques: Integrate additional tokenization techniques to provide a comprehensive toolkit for NLP tasks.
Tokenizer is released under the MIT License. Feel free to use, modify, and distribute the project according to the terms of the license.
Contributions to the project are welcome. Fork the repository, make your changes, and submit a pull request.
Tokenizer was inspired by the work of OpenAI and Andrej Karpathy and aims to contribute to the Rust and AI communities.