GitHub - alsemitoo/deep_learning

Deep Learning project: comparing tokenization strategies

This project provides a unified experimental framework to compare three tokenizer strategies Byte-level, HuggingFace BPE, and a Custom BPE implementation across multiple languages (English and Chinese). It trains a small GPT-style model on Wikipedia streaming data and evaluates tokenization compression ratio, perplexity, BPC, and inference speed.

Features

Byte-level tokenizer (256 fixed vocab)
HuggingFace BPE tokenizer (trainable, with special tokens)
Custom GPT-2–style BPE tokenizer (full merge-learning implementation)
Unified GPT training loop for fair comparison
Compression ratio metrics (tokens per character)
Bits-per-character (BPC) evaluation
Inference speed measurement across tokenizers

Usage

Run all experiments:

python experiment.py --mode all

Run a specific tokenizer mode:

python experiment.py --mode hf_bpe
Modes: --all --byte --hf_bpe --custom_bpe

Jupyter Notebooks

The repository includes:

main_results.ipynb — Visualizations, training/validation curves, BPC comparisons, tokenizer efficiency plots and inference speed analysis generated from experiment outputs.

Requirements

See requirements.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
final jsons		final jsons
README.md		README.md
experiment.py		experiment.py
main_results.ipynb		main_results.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning project: comparing tokenization strategies

Features

Usage

Jupyter Notebooks

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deep Learning project: comparing tokenization strategies

Features

Usage

Jupyter Notebooks

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages