# Sentencepiece python module


This notebook describes comprehensive examples of sentencepiece Python module.
Since Python module calls C++ API through SWIG,  this document is also useful for developing C++ client.

## Install and data preparation

We use the small training data (TIKVAH.txt) in this example.
([TIKVAH-ETHIOPIA](https://t.me/tikvahethiopia) is a Telegram channel having over 1.2M subscribers.)

## Basic  end-to-end example



In [1]:
import sentencepiece as spm

# train sentencepiece model from `TIKVAH.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('በአዲስ አበባ የአሜሪካ ኤምባሲ'))
print(sp.encode_as_ids('በአዲስ አበባ የአሜሪካ ኤምባሲ'))

# decode: id => text
print(sp.decode_pieces(['_በአዲስ', '_አበባ', '_የአሜሪካ', '_ኤ', 'ምባሲ']))

['▁በአዲስ', '▁አበባ', '▁የአሜሪካ', '▁ኤምባሲ']
[688, 231, 1062, 1776]
_በአዲስ_አበባ_የአሜሪካ_ኤምባሲ


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=cleaned.txt --model_prefix=m --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: cleaned.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  p

In [2]:
print(sp.decode_ids([460, 133, 774, 1276]))


ሕኩታችን ለመስጠት


In [4]:
# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(460))
print(sp.piece_to_id('▁በአዲስ'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

2000
▁በአዲስ
460
0
<unk> False
<s> True
</s> True


## Loads model from byte stream

Sentencepiece's model file is just a serialized [protocol buffer](https://developers.google.com/protocol-buffers/). We can instantiate sentencepiece processor from byte object with **load_from_serialized_proto** method.

In [3]:
import tensorflow as tf

# Assumes that m.model is stored in non-Posix file system.
serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read()

sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)

print(sp.encode_as_pieces('በአዲስ አበባ የአሜሪካ ኤምባሲ'))

2024-01-26 01:01:07.971834: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-26 01:01:08.871255: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-26 01:01:08.871594: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-26 01:01:08.967643: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-26 01:01:09.057447: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-26 01:01:09.059167: I tensorflow/core/platform/cpu_feature_guard.cc:1

['▁በአዲስ', '▁አበባ', '▁የአሜሪካ', '▁ኤምባሲ']


## User defined and control symbols

We can define special tokens (symbols) to tweak the DNN behavior through the tokens.   Typical examples are  [BERT](https://arxiv.org/abs/1810.04805)'s special symbols., e.g., [SEP] and [CLS].

There are two types of special tokens:

- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence.
- **control symbol**:  We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However,  we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text.

In [5]:
# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('በአዲስ አበባ የአሜሪካ<sep> ኤምባሲ<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁በአዲስ', '▁አበባ', '▁የአሜሪካ', '<sep>', '▁ኤምባሲ', '<cls>']
3
4
3= <sep>
4= <cls>


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=cleaned.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: cleaned.txt
  input_format: 
  model_prefix: m_user
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <sep>
  user_defined_symbols: <cls>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
 

In [14]:
# Example of control symbols
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('በአዲስ አበባ የአሜሪካ<sep>ኤምባሲ<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty

['▁በአዲስ', '▁አበባ', '▁የአሜሪካ', '<', 's', 'e', 'p', '>', 'ኤ', 'ም', 'ባ', 'ሲ', '<', 'c', 'l', 's', '>']
3
4
3= 
4= 


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m_ctrl
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  control_symbols: <sep>
  control_symbols: <cls>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_

 BOS/EOS (&lt;s&gt;, &lt;/s&gt;) are defined as control symbols, but we can define them as user defined symbols.

In [15]:
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> በአዲስ</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> በአዲስ</s>'))   # <s>,</s> are handled as one token.

['▁', '<', 's', '>', '▁በአዲስ', '<', '/', 's', '>']
['▁', '<s>', '▁በአዲስ', '</s>']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m_bos_as_user
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <s>
  user_defined_symbols: </s>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_voca

## Manipulating BOS/EOS/EOS/PAD symbols

BOS, EOS, UNK, and PAD ids can be obtained with **bos_id()**, **eos_id()**,  **unk_id()**, and **pad_id()** methods. We can explicitly insert these ids as follows.

In [17]:
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('በአዲስ አበባ'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('በአዲስ አበባ') + [sp.eos_id()])

bos= 1
eos= 2
unk= 0
pad= -1
[434, 111]
[1, 434, 111, 2]


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad

## Sampling and nbest segmentation for subword regularization

When **--model_type=unigram** (default) is used,  we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [[kudo18]](https://www.google.com/search?q=subword+regularization&rlz=1CAASUL_enJP841&oq=subword+regu&aqs=chrome.0.69i59j69i61j69i57j69i61l2j0.1571j0j7&sourceid=chrome&ie=UTF-8) for more detail.

In [18]:
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('በአዲስ አበባ', -1, 0.1))

for n in range(10):
  print(sp.sample_encode_as_ids('በአዲስ አበባ', -1, 0.1))

['▁በአዲስ', '▁አ', 'በ', 'ባ']
['▁በአዲስ', '▁አ', 'በ', 'ባ']
['▁በአዲስ', '▁አበባ']
['▁በ', 'አዲስ', '▁አበባ']
['▁በአዲስ', '▁አበባ']
['▁በአዲስ', '▁አበባ']
['▁በ', 'አ', 'ዲ', 'ስ', '▁አ', 'በ', 'ባ']
['▁', 'በ', 'አዲስ', '▁አ', 'በ', 'ባ']
['▁በአ', 'ዲ', 'ስ', '▁አ', 'በ', 'ባ']
['▁በአ', 'ዲ', 'ስ', '▁አበባ']
[434, 111]
[434, 111]
[8, 1066, 111]
[8, 1066, 111]
[3, 34, 1066, 111]
[8, 130, 110, 13, 16, 34, 41]
[434, 3, 130, 34, 41]
[434, 111]
[3, 34, 1066, 3, 130, 34, 41]
[8, 1066, 111]


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m --vocab_size=2000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad

In [19]:
# get 10 best
print(sp.nbest_encode_as_pieces('በአዲስ አበባ', 10))
print(sp.nbest_encode_as_ids('በአዲስ አበባ', 10))

[['▁በአዲስ', '▁አበባ'], ['▁በ', 'አዲስ', '▁አበባ'], ['▁በአዲስ', '▁አ', 'በ', 'ባ'], ['▁', 'በ', 'አዲስ', '▁አበባ'], ['▁በአ', 'ዲ', 'ስ', '▁አበባ'], ['▁በአዲስ', '▁', 'አ', 'በ', 'ባ'], ['▁በ', 'አ', 'ዲ', 'ስ', '▁አበባ'], ['▁በ', 'አዲስ', '▁አ', 'በ', 'ባ'], ['▁', 'በ', 'አ', 'ዲ', 'ስ', '▁አበባ'], ['▁', 'በ', 'አዲስ', '▁አ', 'በ', 'ባ']]
[[434, 111], [8, 1066, 111], [434, 16, 34, 41], [3, 34, 1066, 111], [188, 110, 13, 111], [434, 3, 130, 34, 41], [8, 130, 110, 13, 111], [8, 1066, 16, 34, 41], [3, 34, 130, 110, 13, 111], [3, 34, 1066, 16, 34, 41]]


## BPE (Byte pair encoding) model

Sentencepiece supports BPE (byte-pair-encoding) for subword segmentation with **--model_type=bpe** flag.   We do not find empirical differences in translation quality between BPE and unigram model, but unigram model can perform sampling and n-best segmentation. See subword regularization paper [[kudo18]](https://www.google.com/search?q=subword+regularization&rlz=1CAASUL_enJP841&oq=subword+regu&aqs=chrome.0.69i59j69i61j69i57j69i61l2j0.1571j0j7&sourceid=chrome&ie=UTF-8) for more detail.

In [21]:
spm.SentencePieceTrainer.train('--input=cleaned.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('በአዲስአበባየአሜሪካኤምባሲ'))
print(sp_bpe.nbest_encode_as_pieces('በአዲስ አበባ', 5))  # returns an empty list.

*** BPE ***
['▁በአዲስ', 'አ', 'በ', 'ባ', 'የ', 'አ', 'ሜሪካ', 'ኤ', 'ምባ', 'ሲ']
[]


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m_bpe
  model_type: BPE
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  

In [22]:
spm.SentencePieceTrainer.train('--input=TIKVAH.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('በአዲስአበባየአሜሪካኤምባሲ'))
print(sp_unigram.nbest_encode_as_pieces('በአዲስአበባየአሜሪካኤምባሲ', 5))

*** Unigram ***
['▁በአዲስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ']
[['▁በአዲስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ'], ['▁በ', 'አዲስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ'], ['▁', 'በ', 'አዲስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ'], ['▁በአ', 'ዲ', 'ስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ'], ['▁በ', 'አ', 'ዲ', 'ስ', 'አ', 'በ', 'ባ', 'የ', 'አሜሪካ', 'ኤ', 'ም', 'ባ', 'ሲ']]


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=TIKVAH.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: TIKVAH.txt
  input_format: 
  model_prefix: m_unigram
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  b