Skip to content

fbrad/subword_tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Subword Tokenizers

This repo explores the different subword tokenizers.

Subword tokenizers

Algorithm Base unit Implementations Paper
Byte-pair encoding (BPE) Unicode code original implementation, FastBPE, SentencePiece repo Neural Machine Translation of Rare Words with Subword Units
byte-level BPE byte HuggingFace repo, GPT2 repo Language Models are Unsupervised Multitask Learners (GPT2)
Wordpiece Unicode code BERT repo Google's Neural Machine Translation System
Unigram Language Model Unicode code SentencePiece repo Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

|

Large Pretrained Language Models and their tokenizers:

Model Repo Tokenizer
BERT (Google) GitHub link WordPiece
GPT2 (OpenAI) GitHub link byte-level BPE
RoBERTa (Facebook) GitHub link byte-level BPE
Transformer-XL (CMU) GitHub link words
XLM (Facebook) GitHub link BPE
XLNet (CMU) GitHub link BPE (from SentencePiece)
CTRL (Salesforce) GitHub link BPE (from fastBPE)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published