# Byte Pair Encoding (BPE) tokenizer

This tokenizers uses sub-words which are learned by observing the occurences of tokens within a corpus.

In [None]:
! pip install -q requests

In [None]:
from simple_tokenizers import BPETokenizer

tokenizer = BPETokenizer()
tokenizer.vocab

As mentioned, word based tokenizers need some corpus. In this example, we will use the `tinyshakespeare` dataset. This requires the `requests` package. (`pip install requests`).

In [None]:
import requests

data = requests.get(
    "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
).text[:10000]

The Byte Pair Encoding algorithm is described in the paper Sennrich et al., 2016 (<https://arxiv.org/pdf/1508.07909>). The tokenizers starts off with unicode characters and extends its vocabulary using the BPE algorithm.The algorithm finds the most frequent bigrams (pairs of tokens) in the corpus and merges them into new tokens. The new tokens are then used as the vocabulary for the tokenizer.

In [None]:
tokenizer.fit(data, vocab_size=512)
tokenizer.vocab

In [None]:
encoding = tokenizer.encode("Github.")
encoding

Since the tokenizer also contains unicode characters, it can handle unseen words. These will just be chopped up into more tokens.

In [None]:
tokenizer.decode(encoding)

Since fitting the BPE tokenizer can take some time, the tokenizer can also be saved to disk and loaded later.

In [None]:
import pickle

sd = tokenizer.get_state_dict()
with open("tokenizer.pkl", "wb") as file:
    pickle.dump(sd, file)

In [None]:
with open("tokenizer.pkl", "rb") as file:
    sd = pickle.load(file)

tokenizer = BPETokenizer()
tokenizer.load_state_dict(sd)
tokenizer.vocab