### What is tokenization?


From natural languaje text (or code), we take words, o subwords and then map to integers to train and feed machine learning models like neural networks. But it's not a simple process, what do we do with puntuation and more subtle things like special characters and contractions (in English)?

#### Approaches

- Word based: Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning "don't" into "do n't"). Generally, punctuation marks are also split into separate tokens.

- Subword based: Split words into smaller parts, based on the most commonly occurring substrings. For instance, "occasion" might be tokenized as "o c ca sion."

- Character-based: Split a sentence into its individual characters.

In [12]:
from fastai.text.all import (
    coll_repr,
    defaults,
    first,
    get_text_files,
    L,
    Numericalize,
    Tokenizer,
    SubwordTokenizer,
    untar_data,
    URLs,
    WordTokenizer,
)

### Word Tokenization with FastAI

In [2]:
# we will use text from IMDB movie reviews

path = untar_data(URLs.IMDB)
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [3]:
print("Data was downloaded on", path)
# check a text from the first file containing reviews
example_txt = files[0].open().read()[:75]
example_txt

Data was downloaded on /home/david/.fastai/data/imdb


'Absolutely the worst piece of crap my brother and I have seen. The movie lo'

In [4]:
# We use a English word tokenizer
spacy = WordTokenizer()

tokens = first(spacy([example_txt]))

# this function only print n elements from a collection, and it displays the full lenght
coll_repr(tokens, max_n=30)

"(#16) ['Absolutely','the','worst','piece','of','crap','my','brother','and','I','have','seen','.','The','movie','lo']"

In [5]:
# Add special tokens with fastai's Tokenizer class
tkn = Tokenizer(spacy)

coll_repr(tkn(example_txt), max_n=31)

"(#19) ['xxbos','xxmaj','absolutely','the','worst','piece','of','crap','my','brother','and','i','have','seen','.','xxmaj','the','movie','lo']"

Some of those special tokens are:

- **xxbos**: Indicates the beginning of a text (here, a review)
- **xxmaj**: Indicates the next word begins with a capital (since we lowercased everything)
- **xxunk**: Indicates the word is unknown

In [6]:
# actually we can see the rules applied during fastai's tokenization
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

Here is a brief summary of what each does:

- **fix_html**: Replaces special HTML characters with a readable version.
- **replace_rep**: Replaces any character repeated three times or more with a special token for repetition (xxrep), the number of repetitions then the character
- **replace_wrep**: Replaces any word repeated three times or more with a special token for word repetition (xxwrep), the number of repetitions, then the word
- **spec_add_spaces**: Adds spaces around / and #
- **rm_useless_spaces**: Removes all repetitions of the space character
- **replace_all_caps**: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it
- **replace_maj**: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it
- **lowercase**: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)

### Subword tokenization

In some languages the concept of space is not so clear, or event don't exist at all (example, Chinese).
The approach of subword tokenization proceeds in two steps:
1) Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
2) Tokenize the corpus using this vocab of subword units.

In [7]:
# to demostrate this method we will select a corpus of 2000 movie reviews
txts = L(o.open().read() for o in files[:2000])

In [8]:
def subword(vocab_size: int):
    sp = SubwordTokenizer(vocab_sz=vocab_size)
    sp.setup(txts)

    # return the first 40 tokens as demostration
    return " ".join(first(sp([txts]))[:40])

In [None]:
subword(150) # this crashes the kernel :(

### Numericalization with fastai

It the process of mapping tokens to integers.

In [9]:
tokens = tkn(example_txt)

In [10]:
coll_repr(tkn(example_txt), max_n=31)

"(#19) ['xxbos','xxmaj','absolutely','the','worst','piece','of','crap','my','brother','and','i','have','seen','.','xxmaj','the','movie','lo']"

In [13]:
# Just like SubwordTokenizer we need to call setup on Numeralize
toks200 = txts[:200].map(tkn)
toks200[0]

(#167) ['xxbos','xxmaj','absolutely','the','worst','piece','of','crap','my','brother'...]

In [14]:
numericalizer = Numericalize()

numericalizer.setup(toks200)
coll_repr(numericalizer.vocab, 20)

"(#2136) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','it','in','this'...]"

The special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3`, `max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60,000 with a special unknown word token, xxunk. This is useful to avoid having an overly large embedding matrix, since that can slow down training and use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words

In [15]:
# we can use out numericalizer to get the integers from text tokens

numericalizer(tokens)[:20]

TensorText([  2,   8, 495,   9, 237, 452,  14, 879,  89, 666,  12,  20,  43,
            155,  11,   8,   9,  27,   0])