# **Tokenization in Natural Language Processing**

## **Tokenization**

Tokenization is splitting the large chunk of word, sentence, document into smaller unit (single word or combination of words). Smaller units are known as tokens.

>  <img src="https://curator-production.s3.us.cloud-object-storage.appdomain.cloud/uploads/course-v1:IBMSkillsNetwork+GPXX0A7BEN+v1.jpg" height="200" width="400" align="centre">

**Why we need tokenization?**

Tokenization is essential because machines can only process numerical data. To enable machines to comprehend raw text, we must first divide it into individual words. These words are then typically encoded into numeric formats, such as using the Bag of Words model or the TF-IDF method. In essence, tokenization is a crucial step in transforming raw text into a format that machines can understand.

In the NLP pipeline, tokenization is the initial step of data processing. It facilitates the subsequent analysis of textual data by extracting useful features from the text.

## **Types of Tokenizers in NLP**

Different types of tokenizers are used depending on the specific scenario. For instance, if we are developing an NLP-based phishing email detector, we first need to break down the email content into words using a word tokenizer. Similarly, if we want to analyze a paragraph sentence by sentence, we would use a sentence tokenizer.

The NLTK library in Python provides several types of tokenizers, including:


*   Word Tokenizer
*   Sentence Tokenizer
*   Tweet Tokenizer
*   Regex Tokenizer







### **Word Tokenizer**

A word tokenizer splits text into individual words. In Python, this can be achieved using the split() method, which divides the text based on whitespace by default. This method, known as whitespace tokenization, often falls short as it fails to correctly handle contraction words like "can't," "hasn't," and "wouldn't." Using the NLTK-based word tokenizer is more effective as it can handle contractions and words like "o'clock" that are not contractions.


In [4]:
document = '''At five o'clock in the morning I went to railway station near by my home.
              I'll never go to that railway station again.
              '''
print(document)

At five o'clock in the morning I went to railway station near by my home.
              I'll never go to that railway station again.
              


Whitespace tokenization

In [5]:
print(document.split())

['At', 'five', "o'clock", 'in', 'the', 'morning', 'I', 'went', 'to', 'railway', 'station', 'near', 'by', 'my', 'home.', "I'll", 'never', 'go', 'to', 'that', 'railway', 'station', 'again.']


NLTK word tokenizer

In [6]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
words = word_tokenize(document)
print(words)

['At', 'five', "o'clock", 'in', 'the', 'morning', 'I', 'went', 'to', 'railway', 'station', 'near', 'by', 'my', 'home', '.', 'I', "'ll", 'never', 'go', 'to', 'that', 'railway', 'station', 'again', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


As we can see from the above code the whitespace tokenizer is unable to identify the contraction word “I’ll” and also concatenated “.” with the words ‘home’ and ‘again’. On the other hand, NLTK’s word tokenizer not only breaks on whitespaces but also breaks contraction words such as I’ll into “I” and “‘ll” as well as it doesn’t break “o’clock” and treats it as a separate token.

### **Sentence Tokenizer**

Tokenising based on a sentence requires us to split based on the period (‘.’). Let’s have a look at the NLTK sentence tokenizer in the below code.

In [5]:
document = '''At five o'clock in the morning I went to railway station near by my home.
              I'll never go to that railway station again.
              '''
print(document)

At five o'clock in the morning I went to railway station near by my home.
              I'll never go to that railway station again.
              


NLTK sentence tokenizer

In [6]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)
print(sentences)

["At five o'clock in the morning I went to railway station near by my home.", "I'll never go to that railway station again."]


### **Tweet Tokenizer**

A problem with word tokenizer is that it fails to tokenize emojis and other complex special characters such as words with hashtags. Emojis are common these days and people use them all the time.

In [7]:
msg="WIN with PRIMI to celebrate #NationalPizzaDay! RT this tweet, tell us what your fav PRIMI PIZZA is & stand to WIN a R250 voucher. This is gr8 <3  🥳🍕"
print(msg)

WIN with PRIMI to celebrate #NationalPizzaDay! RT this tweet, tell us what your fav PRIMI PIZZA is & stand to WIN a R250 voucher. This is gr8 <3  🥳🍕


In [8]:
from nltk.tokenize import word_tokenize
print(word_tokenize(msg))

['WIN', 'with', 'PRIMI', 'to', 'celebrate', '#', 'NationalPizzaDay', '!', 'RT', 'this', 'tweet', ',', 'tell', 'us', 'what', 'your', 'fav', 'PRIMI', 'PIZZA', 'is', '&', 'stand', 'to', 'WIN', 'a', 'R250', 'voucher', '.', 'This', 'is', 'gr8', '<', '3', '🥳🍕']


In [9]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

tknzr.tokenize(msg)

['WIN',
 'with',
 'PRIMI',
 'to',
 'celebrate',
 '#NationalPizzaDay',
 '!',
 'RT',
 'this',
 'tweet',
 ',',
 'tell',
 'us',
 'what',
 'your',
 'fav',
 'PRIMI',
 'PIZZA',
 'is',
 '&',
 'stand',
 'to',
 'WIN',
 'a',
 'R250',
 'voucher',
 '.',
 'This',
 'is',
 'gr8',
 '<3',
 '🥳',
 '🍕']

As we can see from the above code, the word tokenizer breaks the emoji '<3' into ‘<‘ and ‘3’ which is unacceptable in the domain of tweets language.

Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can alone prove to be a really good predictor of sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching for specific topics or photos on social media apps such as Instagram and Facebook. So there, you want to use the hashtag as is.

But tweet tokenizer handles all the emojis and the hashtags pretty well.

### **Regex Tokenizer**

Regex tokenizer takes a regular expression and tokenizes and returns results based on the pattern specified in the regular expression.

Now, let’s say we want to tokenize the tweet message based on hashtags. Then let’s look at the below code to understand how to use the regex tokenizer.

In [10]:
from nltk.tokenize import regexp_tokenize
message = "WIN with PRIMI to celebrate #NationalPizzaDay! RT this tweet, tell us what your fav PRIMI PIZZA is & stand to WIN a R250 voucher. This is gr8 <3  🥳🍕"
pattern = "#[\w]+"

In [11]:
regexp_tokenize(message, pattern)

['#NationalPizzaDay']

As we can see from the above code the regex tokenizer successfully tokenizes the tweet based on the hashtag ‘NationalPizzaDay’

### **Byte Pair Encoding (BPE) Algorithm**

BPE was originally a data compression algorithm that you use to find the best way to represent data by identifying the common byte pairs. We now use it in NLP to find the best representation of text using the smallest number of tokens.

Working process:

* Append an identifier (</w>) to the end of each word to mark the word boundaries, then calculate the frequency of each word in the text.
* Split the words into individual characters and compute the frequency of each character.
* For a set number of iterations, identify the most frequent consecutive byte pairs among the character tokens and merge them.
* Repeat this process until the predefined iteration limit or the token limit is reached.

Import Libraries

re: Provides regular expression matching operations.

defaultdict: A subclass of the dictionary that calls a factory function to supply missing values.

In [28]:
import re
from collections import defaultdict

Computes the frequency of consecutive character pairs in the vocabulary.

In [29]:
def get_stats(vocab):
    """
    Given a vocabulary (dictionary mapping words to frequency counts), returns a
    dictionary of tuples representing the frequency count of pairs of characters
    in the vocabulary.
    """
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

Merges the most frequent pair of characters in the vocabulary.

In [30]:
def merge_vocab(pair, v_in):
    """
    Given a pair of characters and a vocabulary, returns a new vocabulary with the
    pair of characters merged together wherever they appear.
    """
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

Creates a vocabulary with the frequency of each word, where words are split into characters and end with </w>.

In [31]:
def get_vocab(data):
    """
    Given a list of strings, returns a dictionary of words mapping to their frequency
    count in the data.
    """
    vocab = defaultdict(int)
    for line in data:
        for word in line.split():
            vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

Performs the Byte Pair Encoding algorithm for n iterations.

In [32]:
def byte_pair_encoding(data, n):
    """
    Given a list of strings and an integer n, returns a list of n merged pairs
    of characters found in the vocabulary of the input data.
    """
    vocab = get_vocab(data)
    for i in range(n):
        pairs = get_stats(vocab)
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
    return vocab


In [33]:
# Example usage:
corpus = '''Tokenization is the process of breaking down
a sequence of text into smaller units called tokens,
which can be words, phrases, or even individual characters.
Tokenization is often the first step in natural languages processing tasks
such as text classification, named entity recognition, and sentiment analysis.
The resulting tokens are typically used as input to further processing steps,
such as vectorization, where the tokens are converted
into numerical representations for machine learning models to use.'''
data = corpus.split('.')

n = 230
bpe_pairs = byte_pair_encoding(data, n)
bpe_pairs

{'Tokenization</w>': 2,
 'is</w>': 2,
 'the</w>': 3,
 'process</w>': 1,
 'of</w>': 2,
 'breaking</w>': 1,
 'down</w>': 1,
 'a</w>': 1,
 'sequence</w>': 1,
 'text</w>': 2,
 'into</w>': 2,
 'smaller</w>': 1,
 'units</w>': 1,
 'called</w>': 1,
 'tokens,</w>': 1,
 'which</w>': 1,
 'can</w>': 1,
 'be</w>': 1,
 'words,</w>': 1,
 'phrases,</w>': 1,
 'or</w>': 1,
 'even</w>': 1,
 'individual</w>': 1,
 'characters</w>': 1,
 'often</w>': 1,
 'first</w>': 1,
 'step</w>': 1,
 'in</w>': 1,
 'natural</w>': 1,
 'languages</w>': 1,
 'processing</w>': 2,
 'tasks</w>': 1,
 'such</w>': 2,
 'as</w>': 3,
 'classification,</w>': 1,
 'named</w>': 1,
 'entity</w>': 1,
 'recognition,</w>': 1,
 'and</w>': 1,
 'sentiment</w>': 1,
 'analysis</w>': 1,
 'The</w>': 1,
 'resulting</w>': 1,
 'tokens</w>': 2,
 'are</w>': 2,
 'typically</w>': 1,
 'used</w>': 1,
 'input</w>': 1,
 'to</w>': 2,
 'further</w>': 1,
 'steps,</w>': 1,
 'vectorization,</w>': 1,
 'where</w>': 1,
 'converted</w>': 1,
 'numerical</w>': 1,
 'repres

**Result:**
The output shows the frequency count of merged character pairs identified in the provided text corpus using the Byte Pair Encoding (BPE) algorithm over 230 iterations. Each entry in the resulting dictionary represents a merged character pair and its corresponding frequency within the corpus. The algorithm repeatedly merges the most frequently occurring character pairs until the designated number of iterations is completed.

**End of Notebook**