# Tokenization

see: https://en.wikipedia.org/wiki/Byte_pair_encoding

## tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models:
- 'gpt2'
- 'r50k_base'
- 'p50k_base'
- 'p50k_edit'
- ‘cl100k_base'
- 'o200k_base’

Further reading:
- see: https://github.com/openai/tiktoken
- see: https://pypi.org/project/tiktoken/
- see: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
- see: https://stackoverflow.com/questions/76106366/how-to-use-tiktoken-in-offline-mode-computer
  

In [1]:
!pip install tiktoken



In [2]:
prompt = "This is a text about Albert Einstein"

In [3]:
import tiktoken

In [12]:
o200k_tokenizer = tiktoken.get_encoding("o200k_base")
tokens = o200k_tokenizer.encode(prompt)
print(tokens)
prompt = o200k_tokenizer.decode(tokens)
print(prompt)

[2500, 382, 261, 2201, 1078, 40833, 83400]
This is a text about Albert Einstein


In [14]:
o200k_tokenizer.special_tokens_set

{'<|endofprompt|>', '<|endoftext|>'}

In [13]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
tokens = gpt2_tokenizer.encode(prompt)
print(tokens)
prompt = gpt2_tokenizer.decode(tokens)
print(prompt)

[1212, 318, 257, 2420, 546, 9966, 24572]
This is a text about Albert Einstein


In [15]:
gpt2_tokenizer.special_tokens_set

{'<|endoftext|>'}

In [19]:
gpt2_tokenizer.n_vocab

50257

In [25]:
vocabulary = {}
for idx in range(gpt2_tokenizer.n_vocab):
    vocabulary[idx] = gpt2_tokenizer.decode([idx])

for idx in range(10):
    print(idx,"-->",vocabulary[idx])
for idx in range(1000,1010):
    print(idx,"-->",vocabulary[idx])

0 --> !
1 --> "
2 --> #
3 --> $
4 --> %
5 --> &
6 --> '
7 --> (
8 --> )
9 --> *
1000 --> ale
1001 -->  Se
1002 -->  If
1003 --> //
1004 -->  Le
1005 -->  ret
1006 -->  ref
1007 -->  trans
1008 --> ner
1009 --> ution


## SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE)) and unigram language model) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

- see: https://github.com/google/sentencepiece
- see: https://pypi.org/project/sentencepiece/

In [26]:
!pip install sentencepiece

Collecting sentencepiece
  Using cached sentencepiece-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (7.7 kB)
Using cached sentencepiece-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (1.2 MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0


In [39]:
import io
import sentencepiece as spm

model = io.BytesIO()
spm.SentencePieceTrainer.train(input='data/ALICE/alice.txt', model_writer=model, vocab_size=2000)


sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: data/ALICE/alice.txt
  input_format: 
  model_prefix: 
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  diffe

In [41]:
# Serialize the model as file.
with open('data/ALICE/alice.model', 'wb') as f:
   f.write(model.getvalue())

# Directly load the model from serialized model.
tokenizer = spm.SentencePieceProcessor(model_proto=model.getvalue())
tokens = tokenizer.encode('this is Alice')
print(tokens)
prompt = tokenizer.decode(tokens)
print(prompt)

[55, 60, 17]
this is Alice
