# Tokenization

![](../figs/intro_nlp/tokenization/entelecheia_puzzle_pieces.png)

## Overview

NLP systems have three main components that help machines understand natural language:

- **Tokenization**: Splitting a string into a list of tokens.
- **Embedding**: Mapping tokens to vectors.
- **Model**: A neural network that takes token vectors as input and outputs predictions.

Tokenization is the first step in the NLP pipeline. 

- Tokenization is the process of splitting a string into a list of tokens. 
- For example, the sentence "I like to eat apples" can be tokenized into the list of tokens `["I", "like", "to", "eat", "apples"]`. 
- The tokens can be words, characters, or subwords.

## What is Tokenization?

- Tokenization is the process of representing a text in smaller units called tokens.
- In a very simple case, we can simply map every word in the text to a numerical index.
- For example, the sentence "I like to eat apples" can be tokenized into the list of tokens:
  > `["I", "like", "to", "eat", "apples"]`. 
- Then, each token can be mapped to a unique index, such as:
  > `{"I": 0, "like": 1, "to": 2, "eat": 3, "apples": 4}`.
- There are more linguistic features to consider when tokenizing a text, such as punctuation, capitalization, and so on.

## Why do we need tokenization?

- "How can we make a machine read a sentence?"
- Machines don’t know any language, nor do they understand sounds or phonetics.
- They need to be taught from scratch.
- The first step is to break down the sentence into smaller units that the machine can process.
- Tokenization determines how the input is represented to the model.
- This decision has a huge impact on the performance of the model.

## How do we identify words in text?

For a language like English, this seems like a simple task. We can simply split the text by spaces. 

```
A word is any sequence of alphabetical characters between whitespaces that’s not a punctuation mark?
```


However, there are many cases where this is not true.

- What about contractions? 
  - "I'm" is a single word, but it is split into two tokens.
- What about abbreviations? 
  - "U.S." is a single word, but it is split into two tokens.
- What about hyphenated words? 
  - "self-driving", "R2-D2" are single words, but they are split into two tokens.
- What about complex names? 
  - "New York" is a single word, but it is split into two tokens.
- What about languages like Chinese that have no spaces between words?

### Words aren’t just defined by blanks

Problem 1: Compounding

```
“ice cream”, “website”, “web site”, “New York-based”
```

Problem 2: Other writing systems have no blanks

```
Chinese: 我开始写⼩说 = 我 开始 写 ⼩说 (I start(ed) writing novel(s))
```

Problem 3: Contractions and Clitics

```
English: “doesn’t” , “I’m” ,
Italian: “dirglielo” = dir + gli(e) + lo (tell + him + it)
```


### Tokenization Standards

Any actual NLP system will assume a particular tokenization standard.

- NLP systems are usually trained on particular corpora (text datasets) that everybody uses.
- These corpora often define a de facto standard.

Penn Treebank 3 standard:

- Input:
  > `"The San Francisco-based restaurant," they said, "doesn’t charge $10".`
- Output:
  > `“_ The _ San _ Francisco-based _ restaurant _ , _” _ they_ said* ,* "_ does _ n’t _ charge_ $_ 10 _ " _ . _`


### What about sentence boundaries?

How can we identify that this is two sentences?

```
Mr. Smith went to D.C. Ms. Xu went to Chicago instead.
```

- We can use a period to identify the end of a sentence.
- However, this is not always true.
- Abbreviations, such as "Mr.", "D.C.", "Ms.", "U.S.", "etc." can be followed by a period.


How many sentences are in this text?

```
"The San Francisco-based restaurant," they said, "doesn’t charge $10".
```

Answer: just one, because the comma is not a sentence boundary.

Similarly, we typically treat this also just as one sentence:

```
They said: ”The San Francisco-based restaurant doesn’t charge $10".
```


### Spelling variants, typos, etc.

The same word can be written in different ways:

- with different `capitalizations`:
  - lowercase “cat” (in standard running text)
  - capitalized “Cat” (as first word in a sentence, or in titles/headlines),
  - all-caps “CAT” (e.g. in headlines)
- with different abbreviation or hyphenation styles:
  - US-based, US based, U.S.-based, U.S. based
  - US-EU relations, U.S./E.U. relations, …
- with spelling variants (e.g. regional variants of English):
  - labor vs labour, materialize vs materialise,
- with typos (teh)


## How many different words are there in English?

### Counting words: tokens vs. types

When counting words in text, we distinguish between word types and word tokens:

- The vocabulary of a language is the set of (unique) word types:
  > V = {a, aardvark, …., zyzzva}
- The tokens in a document include all occurrences of the word types in that document or corpus
- The frequency of a word (type) in a document  
  = the number of occurrences (tokens) of that type


How large is the vocabulary of English (or any other language)?

- Vocabulary size = the number of distinct word types
  > Google N-gram corpus: 1 trillion tokens, 13 million word types that appear 40+ times

If you count words in text, you will find that ...

- a few words (mostly closed-class) are very frequent (the, be, to, of, and, a, in, that,…)
- most words (all open class) are very rare.
- even if you’ve read a lot of text, you will keep finding words you haven’t seen before. 
  


### Zipf’s law: the long tail

In a natural language:

- A small number of events (e.g. words) occur with high frequency
- A large number of events occur with very low frequency

![](../figs/intro_nlp/tokenization/1.png)

#### Implications of Zipf’s Law for NLP

The good:

```
Any text will contain a number of words that are very common.
We have seen these words often enough that we know (almost) everything about them.
These words will help us get at the structure (and possibly meaning) of this text.
```

The bad:

```
Any text will contain a number of words that are rare.
We know something about these words, but haven’t seen them often enough to know everything about them.
They may occur with a meaning or a part of speech we haven’t seen before.
```

The ugly:

```
Any text will contain a number of words that are unknown to us.
We have never seen them before, but we still need to get at the structure (and meaning) of these texts.
```

#### Dealing with the bad and the ugly

NLP systems need to be able to generalize from the known to the unknown.

There are two main strategies:

- Linguistic knowledge
  - a finite set of grammatical rules is enough to generate an infinite number of languages
- Machine learning or statistical methods
  - learn representations of words from large amounts of data that often work well for unseen words

## How do we represent words?

Option 1: Words are atomic symbols

- Each (surface) word is a unique symbol
- Add some generalization rules to map different surface forms to the same symbol
  - `Normalization`: map all variants of the same word (form) to the same canonical variant 
    > e.g. lowercase everything, normalize spellings, perhaps spell-check)
  - `Lemmatization`: map each word to its lemma (esp. in English, the lemma is still a word in the language, but lemmatized text is no longer grammatical)
  - `Stemming`: remove endings that differ among word forms (no guarantee that the resulting symbol is an actual word)


Option 2: Represent the structure of each word

```
"books" => "book N pl" (or "book V 3rd sg")
```

- This requries a morphological analyzer
- The output is often a lemma (e.g. "book") and morphological features (e.g. "N pl" for noun plural, "V 3rd sg" for verb 3rd person singular)
- This is particularly useful for languages with rich morphology (e.g. Turkish, Finnish, Hungarian, etc.)
- Less useful for languages with little morphology (e.g. English, German, etc.)