There are 3 tokenization variants
- Word-based
- Character based
- Sub-word based

In [4]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


##### Subword Tokenization
- Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, 
- but rare words should be decomposed into meaningful subwords.

### Sub-word Tokenization

- Split subword only with semantic meaning
- Good coverage with small vocabularies, and close to no unknown tokens

<div align="center">
<img src="https://codenamewei-medium.s3.ap-southeast-1.amazonaws.com/workflow2.png">
</div>

### Subword tokenization
- **let's** as one word
- tokenization as token, ization

<div align="center">
<img src="../metadata/subword.png">
</div>

### Different subword tokenizeation based algorithms
- WordPiece
- Unigram
- Byte-Pair Encoding

- **WordPiece** tokenization is a subword-based tokenization schema adopted by BERT
    - it segments the input text via a longest-match-first tokenization strategy, known as Maximum Matching or MaxMatch
    
<div align="center">
<img src="../metadata/subword_tokenization.png">
</div>  


In [12]:
from transformers import BertTokenizer

checkpoint = "bert-base-uncased"

inputstr = "Using a transformer networking"


tokenizer = BertTokenizer.from_pretrained(checkpoint)

tokenizer(inputstr)


{'input_ids': [101, 2478, 1037, 10938, 2121, 14048, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer(inputstr)

{'input_ids': [101, 2478, 1037, 10938, 2121, 14048, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

## Encoding have two steps 
1. tokenization
    - Split the text into words (or parts of words, punctuation symbols, etc)
2. token conversion to input IDs

In [14]:
# 1. tokenization

tokens = tokenizer.tokenize(inputstr)

print(tokens)

['using', 'a', 'transform', '##er', 'networking']


In [18]:
# 2. token conversion to input IDs

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[2478, 1037, 10938, 2121, 14048]


## Decoding

In [19]:
decoded_string = tokenizer.decode([2478, 1037, 10938, 2121, 14048])

print(decoded_string)

using a transformer networking
