<img src="data/images/div/lecture-notebook-header.png" />

# Tokenization

Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a text or sentence into smaller units called tokens. A token is typically a word, but it can also be a character, subword, or any other meaningful unit of text. Tokenization plays a crucial role in NLP tasks because many algorithms and models operate at the token level, treating each token as a discrete unit for further analysis or processing. By splitting text into tokens, we gain a structured representation that can be leveraged for various NLP tasks, such as text classification, named entity recognition, machine translation, sentiment analysis, and more.

The tokenization process can vary depending on the specific requirements of the task or the language being processed. Here are some common approaches to tokenization:

* **Character Tokenization:** In character tokenization, each character becomes a separate token. For example, the word "tokenization" would be tokenized into ["t", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"].

* **Word Tokenization:** This method splits text into individual words. For example, the sentence "Tokenization is important for NLP tasks" would be tokenized into the tokens ["Tokenization", "is", "important", "for", "NLP", "tasks"].

* **Subword Tokenization**: Subword tokenization splits text into smaller units, such as subword pieces or morphemes. This approach is useful for languages with complex morphology or when dealing with out-of-vocabulary (OOV) words. Examples of subword tokenization algorithms include Byte-Pair Encoding (BPE) and WordPiece.

Tokenization is not always a straightforward process, as certain challenges can arise. For instance, handling contractions ("can't" split into "can" and "'t") or punctuation ("Mr. Smith" treated as ["Mr", ".", "Smith"]) require careful consideration in the tokenization process. Tokenization is typically one of the initial steps in NLP pipelines, serving as a foundation for subsequent tasks such as text preprocessing, feature extraction, and model training. The resulting tokens can be further processed, encoded, or represented in numerical form for computational analysis. Various NLP libraries and frameworks provide built-in tokenization functionalities, making it easier to tokenize text in different languages and handle various tokenization requirements.

In this notebook, we focus on **word tokenization** as it is the most fundemantal approach for many to most NLP tasks. 

## Setting up the Notebook


### Import all Required Packages

We use NLTK a spaCy, two very popular and mature Python packages for language processing.


In [None]:
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer

from nltk import word_tokenize # Simplfied notation; it's a wrapper for the TreebankWordTokenizer


import spacy

# Load English language model (if missing, check out: https://spacy.io/models/en)
nlp = spacy.load('en_core_web_md')  

NLTK provides more tokenizers: http://www.nltk.org/api/nltk.tokenize.html

### Define an Example Document

We first create list of sentences that will form our example document.

In [None]:
sentences = ["Text processing with Python is great.", 
             "It isn't (very) complicated to get started.",
             "However,careful to...you know....avoid mistakes.",
             "Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.",
             "This is so cooool #nltkrocks :))) :-P <3."]

To form the document, we can use the in-built `join()` method to concatenate all sentences using a whitespace as separator.

In [None]:
document = ' '.join(sentences)

# Print the document to see if everything looks alright
print (document)

---

## Tokenization with NLTK

### Document tokenization into sentences

Sometimes, you just want to split a document into sentences and not individual tokens.

In [None]:
sentence_tokenizer = PunktSentenceTokenizer()

# The tokenize() method returns a list containing the sentences
sentences_alt = sentence_tokenizer.tokenize(document)

# Loop over all sentences and print each sentence
for s in sentences_alt:
    print (s)

### Document tokenization into tokens

In the following, we tokenize each sentence individually. This makes the presentation a bit more convenient. In practice, you can tokenize the whole document at once.

#### Naive tokenization

Python provides an in-built method `split()` that splits strings with respect to a user-defined separator. By default, the separator is a whitespace.

In [None]:
print ('\nOutput of split() method:')
for s in sentences:
    print (s.split(' '))
    #print(s.split()) # This is also fine since whitespace is the default separator

The limitation of this approach is obvious, since many tokens are not separated by a whitespace. Most commonly this is the case for punctuation marks.

#### TreebankWordTokenizer

The `TreebankWordTokenizer` is a tokenizer available in the Natural Language Toolkit (NLTK) library for Python. It is specifically designed to tokenize text according to the conventions of the Penn Treebank. The Penn Treebank is a widely used corpus of annotated English text that has been extensively used in natural language processing research.

The `TreebankWordTokenizer` tokenizes text by following the rules and conventions defined in the Penn Treebank. It splits text into words and punctuation marks while considering specific cases such as contractions, hyphenated words, and punctuation attached to words. It is the default tokenizer of NLTK. This tokenizer is commonly used for tasks that rely on the Penn Treebank tokenization conventions, such as training and evaluating language models, part-of-speech tagging, syntactic parsing, and other NLP tasks that benefit from consistent tokenization based on the Penn Treebank guidelines.

In [None]:
treebank_tokenizer = TreebankWordTokenizer()

print ('\nOutput of TreebankWordTokenizer:')
for s in sentences:
    print (treebank_tokenizer.tokenize(s))

print ('\nOutput of the word_tokenize() method:')
for s in sentences:
    print (word_tokenize(s))   

Both outputs are the same, since the `word_tokenize()` method is just a wrapper for the `TreebankWordTokenizer` to simplify the coding.

See how this tokenizer also splits common contractions such as *isn't*, *hasn't*, *haven't*. Other tokenizers (see below) consider such contractions as one token. Being aware how this is handled is, for example, important for sentiment analysis where handling negations is very important to get the sentiment right.

Also, notice how the tokenizer can handle the ellipsis (`...`) correctly in the first case but fails in the second case since an ellipsis is by definition composed of exactly 3 dots. More or less the 3 dots are not handled properly.


#### TweetTokenizer

The `TweetTokenizer` is a specific tokenizer available in the Natural Language Toolkit (NLTK) library for Python that is specifically designed for tokenizing tweets or other social media text. It is tailored to handle the unique characteristics and conventions often found in tweets, such as hashtags, user mentions, emoticons, and URLs. It offers additional functionality compared to general-purpose tokenizers. It takes into account the specific structures and symbols commonly used in tweets, allowing for more accurate and context-aware tokenization of social media text. It recognizes and tokenizes hashtags, user mentions (starting with "@"), URLs, emoticons, and other patterns commonly found in tweets, providing a more fine-grained tokenization approach for analyzing social media text.

In [None]:
tweet_tokenizer = TweetTokenizer()

print ('Output of TweetTokenizer:')
for s in sentences:
    print (tweet_tokenizer.tokenize(s))

Here, both ellipses are recognized, with the second one even "corrected" to three dots.

Note how the tokenizer fails with `:)))`. The problem is that it is not the "official version" of the emoticon -- which is `:)` or `:-)` -- but uses multiple "mouths" to emphasize the expressed sentiment of feeling. If a subsequent analysis not really depends on it, some extra `)` are no big deal in many cases.

The 2 basic alternatives to properly address this issue:
- Clean your text before tokenizing
- Remove all "odd" tokens from the list before further processing
- Write your own sophisticated tokenizer :-)


#### RegexpTokenizer

The RegexpTokenizer is a tokenizer available in the Natural Language Toolkit (NLTK) library for Python. It is a customizable tokenizer that uses regular expressions to split text into tokens based on specified patterns. It allows you to define a regular expression pattern that matches the desired token boundaries. It tokenizes text by identifying substrings that match the specified pattern and separating them into individual tokens.

You can customize the regular expression pattern according to your specific tokenization requirements. For example, if you want to tokenize based on specific characters, you can modify the pattern accordingly. Additionally, you can use character classes, quantifiers, and other regular expression constructs to define more complex tokenization patterns.

The `RegexpTokenizer` provides flexibility and fine-grained control over the tokenization process. It allows you to adapt the tokenizer to the specific needs of your text data and the requirements of your NLP task. By defining appropriate regular expression patterns, you can tokenize text in a way that suits your specific use case, such as handling specialized domains, custom abbreviations, or other text patterns.

In [None]:
pattern = '\w+' # all alphanumeric words
pattern = '[a-zA-Z]+' # all alphanumeric words (without digits)
pattern = '[a-zA-Z\']+' # all alphanumeric words (without digits, but keep contractions)
regexp_tokenizer = RegexpTokenizer(pattern)

print ('\nOutput of RegexpTokenizer for pattern {}:'.format(pattern))
for s in sentences:
    print (regexp_tokenizer.tokenize(s))

---

## Tokenization with spaCy

spaCy's tokenizer is responsible for breaking down text into individual linguistic units, such as words or punctuation marks. It follows a set of rules and heuristics to determine where to split the text. Here's a general overview of how the tokenizer in spaCy works:

* **Whitespace Splitting:** The tokenizer initially splits the text on whitespace characters (e.g., spaces, tabs, newlines) to create token candidates.

* **Prefix Rules:** The tokenizer applies a set of prefix rules to identify and split off leading punctuation marks, such as opening quotation marks or brackets. For example, the tokenizer would split "Hello!" into two tokens, "Hello" and "!".

* **Suffix Rules:** Similarly, the tokenizer applies suffix rules to identify and split off trailing punctuation marks, such as periods or closing quotation marks. For example, it would split "example." into two tokens, "example" and ".".

* **Infixes:** The tokenizer then looks for infixes, which are sequences of characters that appear within a word. It uses rules to determine where to split these infixes, typically when they indicate word boundaries. For instance, the tokenizer would split "can't" into "ca" and "n't".

* **Special Cases:** spaCy's tokenizer handles special cases, such as contractions, abbreviations, emoticons, or currency symbols, where the standard rules might not apply. It uses language-specific knowledge and a customizable list of special case rules to tokenize these instances correctly.

* **Tokenization Exceptions:** spaCy provides a mechanism to define exceptions to the tokenization rules using custom tokenization patterns. This allows users to override the default behavior and handle specific cases according to their needs.

* **Post-processing:** After applying the tokenization rules, the tokenizer performs additional post-processing steps. This may involve removing leading or trailing white spaces, normalizing Unicode characters, or applying language-specific transformations.

The figure below illustrates the tokenization process by applying the different rules and heuristics. This image is directly take from the [spaCy website](https://spacy.io/usage/spacy-101):

<img src='data/images/screenshots/spacy-tokenizer-example.png' />


spaCy's tokenizer is designed to be highly customizable and can be trained or adjusted to accommodate specific domain requirements or languages. It forms the foundation for many subsequent natural language processing tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

### Process Document

Compared to NLTK, the common usage of spaCy is to process a string which not only performs tokenization but also other steps (see later tutorial). Here, we only look at the tokens.

Again, we process each sentence individually to simplify the output.

In [None]:
print ('\nOutput of spaCy tokenizer:')
for s in sentences:
    doc = nlp(s) # doc is an object, not just a simple list
    # Let's create a list so the output matches the previous ones
    token_list = []
    for token in doc:
        token_list.append(token.text) # token is also an object, not a string
    print (token_list)

spaCy does a bit better with the uncommon emoticon, but it splits the hashtag. However, spaCy allows you to customize the rules and heuristics better suit your specific requirements, typically given by your taks your application.

---

## Summary

Tokenization is a fundamental step in Natural Language Processing (NLP) that involves breaking down text into smaller units called tokens. These tokens can be individual words, subwords, characters, or even linguistic units like sentences or paragraphs. Tokenization serves as the first step in processing raw text data to enable further analysis and modeling. The primary goal of tokenization is to establish a standardized representation of textual data that can be easily understood and processed by NLP algorithms. By breaking down text into tokens, NLP models can effectively capture the structure, semantics, and statistical properties of language.

Tokenization techniques vary depending on the specific requirements and characteristics of the task at hand. The most common form of tokenization is word tokenization, where text is segmented into individual words. This approach assumes that words are the basic building blocks of meaning in language. However, in certain cases, subword tokenization can be used to handle out-of-vocabulary words, rare or misspelled words, or languages with complex morphology.

Tokenization is not limited to word boundaries; it can also involve the identification of sentence boundaries, paragraphs, or even smaller linguistic units like named entities or syntactic constituents. Each token retains important linguistic information, such as part-of-speech, lemma, or position, which can be used for subsequent analysis. In addition to the basic segmentation of text, tokenization often involves other tasks, such as normalizing text by removing punctuation, converting text to lowercase, or handling special cases like contractions or abbreviations. These additional steps help to improve the quality and consistency of the tokenized output.

Overall, tokenization plays a crucial role in NLP by transforming raw text into structured data that can be readily processed, analyzed, and modeled by various NLP algorithms and techniques. It serves as a foundation for a wide range of NLP tasks, including sentiment analysis, machine translation, named entity recognition, text classification, and more.
