<h1>Step 1: Tokenization with Basic Split</h1>

In [1]:
# Tokenization with basic split by whitespace
text = """There are multiple ways we can perform tokenization on given text data. 
We can choose any method based on language, library, and purpose of modeling."""
tokens = text.split()
print("Tokens using split():", tokens)


Tokens using split(): ['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'language,', 'library,', 'and', 'purpose', 'of', 'modeling.']


The split() method is basic and splits tokens by whitespace but is limited since it doesn't handle punctuation properly.

<h1>Step 2: Tokenization using Regular Expressions</h1>

In [2]:
import re

text = """Characters like periods, exclamation points, and newline characters are used to separate sentences. 
But one drawback with split() is that we can only use one separator at a time! So sentence tokenization won't be foolproof with split()."""

# Tokenizing words, ignoring punctuation
tokens = re.findall(r"\w+", text)
print("Tokens using regex:", tokens)

# Tokenizing sentences using regex
sentences = re.compile(r'[.!?]\s').split(text)
print("Sentence tokens:", sentences)


Tokens using regex: ['Characters', 'like', 'periods', 'exclamation', 'points', 'and', 'newline', 'characters', 'are', 'used', 'to', 'separate', 'sentences', 'But', 'one', 'drawback', 'with', 'split', 'is', 'that', 'we', 'can', 'only', 'use', 'one', 'separator', 'at', 'a', 'time', 'So', 'sentence', 'tokenization', 'won', 't', 'be', 'foolproof', 'with', 'split']
Sentence tokens: ['Characters like periods, exclamation points, and newline characters are used to separate sentences', '\nBut one drawback with split() is that we can only use one separator at a time', "So sentence tokenization won't be foolproof with split()."]


We are using re.findall to handle tokenization of words, which ignores punctuation. For sentence tokenization, a regular expression is used to split by periods, exclamation marks, and question marks followed by a space.

<h1>Step 3: Tokenization using NLTK</h1>

In [4]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

# Word tokenization using NLTK
text = """There are multiple ways we can perform tokenization on given text data. 
We can choose any method based on language, library, and purpose of modeling."""
tokens = word_tokenize(text)
print("Word tokens with NLTK:", tokens)

# Sentence tokenization using NLTK
sentences = sent_tokenize(text)
print("Sentence tokens with NLTK:", sentences)


Word tokens with NLTK: ['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'language', ',', 'library', ',', 'and', 'purpose', 'of', 'modeling', '.']
Sentence tokens with NLTK: ['There are multiple ways we can perform tokenization on given text data.', 'We can choose any method based on language, library, and purpose of modeling.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The NLTK word_tokenize function handles punctuation, making it more effective for NLP tasks. sent_tokenize is also used for sentence tokenization.

<h1>Step 4: Tokenization using SpaCy (English and Multilingual)</h1>

In [None]:
import spacy

# English Tokenization with SpaCy
nlp_en = spacy.blank("en")
text_en = """There are multiple ways we can perform tokenization on given text data."""
doc_en = nlp_en(text_en)
tokens_en = [token.text for token in doc_en]
print("Tokens with SpaCy (English):", tokens_en)

# Hindi Tokenization with SpaCy
nlp_hi = spacy.blank("hi")
text_hi = """ऐसे कई तरीके हैं जिनसे हम दिए गए टेक्स्ट डेटा पर टोकनाइजेशन कर सकते हैं।"""
doc_hi = nlp_hi(text_hi)
tokens_hi = [token.text for token in doc_hi]
print("Tokens with SpaCy (Hindi):", tokens_hi)

# Gujarati Tokenization with SpaCy
nlp_gu = spacy.blank("gu")
text_gu = """આપેલ ટેક્સ્ટ ડેટા પર આપણે ટોકનાઇઝેશન કરી શકીએ તે ઘણી રીતો છે."""
doc_gu = nlp_gu(text_gu)
tokens_gu = [token.text for token in doc_gu]
print("Tokens with SpaCy (Gujarati):", tokens_gu)


SpaCy is a powerful tool for tokenization, supporting multiple languages. Here we use it to tokenize English, Hindi, and Gujarati texts.

<h1>Step 5: Enhancements</h1>

In [17]:
from tokenizers import Tokenizer

# Load a pre-trained tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

text = """There are multiple ways we can perform tokenization on given text data."""
output = tokenizer.encode(text)

print("Ultra-fast tokenization (Hugging Face):", output.tokens)

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Ultra-fast tokenization (Hugging Face): ['[CLS]', 'there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'token', '##ization', 'on', 'given', 'text', 'data', '.', '[SEP]']


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Ultra-fast tokenization often refers to using libraries that offer optimized tokenization at scale. A library like Hugging Face's tokenizers` is highly optimized for speed and should be included.