# Byte-Pair Encoding

We will be implementing Byte-Pair Encoding using Google's SentencePiece library. Let's get started by installing the library.

`pip3 install sentencepiece`

In [1]:
import config
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor

`Remember:`
- To download the dataset and copy it to the correct location.
- Create a 'models' directory inside the root directory.

## Training and loading model

In [None]:
params = ' '.join([
    '--input=./../{}'.format(config.DATA_PATH),
    '--model_type=bpe',
    '--model_prefix=./../{}/bpe'.format(config.MODEL_PATH),
    '--vocab_size={}'.format(config.VOCAB_SIZE),
    '--pad_id=0',
    '--unk_id=1',
    '--bos_id=2',
    '--eos_id=3'
])
SentencePieceTrainer.train(params)

sp = SentencePieceProcessor()
sp.load('./../{}/bpe.model'.format(config.MODEL_PATH))

## Showcasing Subword Operations

In [3]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks."

In [4]:
print('Number of Unique Tokens: {}'.format(sp.get_piece_size()))

Number of Unique Tokens: 1000


In [5]:
encoded_pieces = sp.encode_as_pieces(text)
print(encoded_pieces)

decoded_pieces = sp.decode_pieces(encoded_pieces)
print(decoded_pieces)

['▁Good', '▁mu', 'ff', 'ins', '▁c', 'ost', '▁', '$3', '.', '88', '.', '▁P', 'le', 'ase', '▁b', 'u', 'y', '▁me', '▁two', '▁of', '▁them', '.', '▁Than', 'ks', '.']
Good muffins cost $3.88. Please buy me two of them. Thanks.


## Showcasing Subword Operations with Numeric Ids

In [6]:
encoded_ids = sp.encode_as_ids(text)
print(encoded_ids)

decoded_ids = sp.decode_ids(encoded_ids)
print(decoded_ids)

[739, 215, 403, 803, 26, 271, 932, 1, 948, 1, 948, 82, 65, 519, 16, 944, 946, 70, 661, 43, 240, 948, 776, 688, 948]
Good muffins cost  ⁇ . ⁇ . Please buy me two of them. Thanks.


In [7]:
piece_id = sp.piece_to_id('▁The')
print(piece_id)

piece = sp.id_to_piece(piece_id)
print(piece)

101
▁The
