# Sentencepiece Python module
This notebook describes comprehensive examples of sentencepiece Python module. 

## Install and data preparation

We use the small training data (botchan.txt) in this example. here I am using a book [Stories of Great Inventors by Hattie E. Macomber](http://www.gutenberg.org/ebooks/19533). This book is freely available in Gutenberg database. You may doenload the dataset from `http://www.gutenberg.org/cache/epub/19533/pg19533.txt`. The daatset is already present as `data/pg19533.txt`.


## Basic  end-to-end example



In [0]:
import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=pg19533.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print("As Pieces : ",sp.encode_as_pieces('My name is Sunil, and I like to Learn.'))
print("As Ids : ",sp.encode_as_ids('My name is Sunil, and I like to Learn.'))

# decode: id => text
print("Joining  Pieces : ",sp.decode_pieces(['▁M', 'y', '▁name', '▁is', '▁S', 'u', 'n', 'il', ',', '▁and', '▁I', '▁like', '▁to', '▁L', 'ear', 'n', '.']))
print("Joining  by Ids : ",sp.decode_ids([248, 20, 300, 38, 56, 106, 39, 591, 5, 14, 76, 149, 7, 472, 1526, 39, 3]))

As Pieces :  ['▁M', 'y', '▁name', '▁is', '▁S', 'u', 'n', 'il', ',', '▁and', '▁I', '▁like', '▁to', '▁L', 'ear', 'n', '.']
As Ids :  [248, 20, 300, 38, 56, 106, 39, 591, 5, 14, 76, 149, 7, 472, 1526, 39, 3]
Joining  Pieces :  My name is Sunil, and I like to Learn.
Joining  by Ids :  My name is Sunil, and I like to Learn.


In [0]:
# returns vocab size
print("Vocab Size : ", sp.get_piece_size())

# id <=> piece conversion
print("Getting Piece by id  : ", sp.id_to_piece(209))
print("Getting Id from Piece  : ", sp.piece_to_id('▁This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print("Getting id for unknown word : ",sp.piece_to_id('__UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
    print(sp.id_to_piece(id), sp.is_control(id))

Vocab Size :  2000
Getting Piece by id  :  ▁took
Getting Id from Piece  :  0
Getting id for unknown word :  0
<unk> False
<s> True
</s> True


## Sampling and nbest segmentation for subword regularization

When **--model_type=unigram** (default) is used,  we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [[kudo18]](https://www.google.com/search?q=subword+regularization&rlz=1CAASUL_enJP841&oq=subword+regu&aqs=chrome.0.69i59j69i61j69i57j69i61l2j0.1571j0j7&sourceid=chrome&ie=UTF-8) for more detail.

In [0]:
# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('Good Morning', -1, 0.1))
  
for n in range(10):
  print(sp.sample_encode_as_ids('Good Morning', -1, 0.1))

['▁Good', '▁M', 'or', 'n', 'ing']
['▁Good', '▁', 'M', 'or', 'n', 'i', 'ng']
['▁', 'G', 'o', 'o', 'd', '▁M', 'or', 'n', 'ing']
['▁Good', '▁', 'M', 'or', 'n', 'ing']
['▁', 'G', 'o', 'o', 'd', '▁M', 'or', 'n', 'ing']
['▁Good', '▁M', 'o', 'r', 'n', 'ing']
['▁Good', '▁M', 'or', 'n', 'i', 'ng']
['▁Good', '▁M', 'o', 'r', 'n', 'i', 'ng']
['▁G', 'o', 'o', 'd', '▁', 'M', 'o', 'r', 'n', 'i', 'n', 'g']
['▁Good', '▁M', 'o', 'r', 'n', 'in', 'g']
[491, 38, 38, 20, 137, 105, 30, 50, 305]
[1732, 137, 38, 46, 30, 13]
[491, 38, 38, 20, 137, 105, 30, 13]
[1732, 12, 373, 105, 30, 50, 305]
[12, 655, 38, 38, 20, 12, 373, 105, 30, 13]
[1732, 137, 105, 30, 13]
[12, 655, 38, 38, 20, 12, 373, 105, 30, 50, 30, 62]
[1732, 137, 38, 46, 30, 79, 62]
[1732, 137, 105, 30, 50, 30, 62]
[1732, 12, 373, 38, 46, 30, 79, 62]
