Tokenizer

Tokenizer is a Rust project inspired by OpenAI, aimed at providing a basic implementation of the Byte Pair Encoding (BPE) algorithm. This project serves as a learning opportunity for Rust enthusiasts, particularly those interested in the field of artificial intelligence.

Introduction

Byte Pair Encoding (BPE) is a popular technique used in natural language processing (NLP) tasks, particularly in tokenization. It involves iteratively merging the most frequent pair of symbols in a corpus, effectively learning subword units that are useful for various NLP tasks.

Tokenizer implements the BPE algorithm in Rust, providing a foundation for further exploration and experimentation in tokenization and NLP.

Features

BPE Implementation: Provides a basic implementation of the Byte Pair Encoding algorithm in Rust.
Extensible: Designed with modularity in mind, allowing for easy expansion with additional modules for different tokenization techniques, such as GPT-2 tokenization.
MIT License: Released under the MIT License, enabling anyone to use, modify, and distribute the project freely.

Usage

Clone the Repository:

git clone https://github.com/usama3627/tokenizer.git

Install Rust: Ensure that you have Rust installed on your system. You can install it using rustup.
Build and Run:
```
cd tokenizer
cargo build
cargo run
```
Training Data: To train the tokenizer, download the training dataset from the provided Link (huggingface dataset) and rename it to myresponse.json. Place the dataset in the project directory. For testing, I am using 100 rows of tweets.

Dependencies

serde = { version = "1.0.104", features = ["derive"] }
serde_json = "1.0.48"

Ensure these dependencies are specified in your Cargo.toml file.

Future Work

In future iterations of the project, the following enhancements can be considered:

Modularization: Refactor the code into modules, separating concerns such as BPE implementation and GPT-2 tokenization.
Optimizations: Explore optimizations to improve the performance of the tokenization process.
Documentation: Enhance documentation to provide detailed explanations of the algorithms and codebase.
Additional Tokenization Techniques: Integrate additional tokenization techniques to provide a comprehensive toolkit for NLP tasks.

License

Tokenizer is released under the MIT License. Feel free to use, modify, and distribute the project according to the terms of the license.

Contributors

Usama Mehmood

Contributions to the project are welcome. Fork the repository, make your changes, and submit a pull request.

Acknowledgements

Tokenizer was inspired by the work of OpenAI and Andrej Karpathy and aims to contribute to the Rust and AI communities.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizer

Introduction

Features

Usage

Dependencies

Future Work

License

Contributors

Acknowledgements

About

Releases

Packages

Languages

License

Usama3627/tokenizer

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

Introduction

Features

Usage

Dependencies

Future Work

License

Contributors

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages