## **Tokenization Techniques**

The choice of tokenization method depends on the nature of the text data and the specific requirements of the NLP task


### **Whitespace Tokenization:**

Method: The simplest form of tokenization involves splitting a text into tokens based on whitespace (spaces, tabs, and line breaks).


In [None]:
sentence =  "Learning Tokenization with ML Archive"
tokens = sentence.split()
print(f"Sentence: {sentence} \nTokens: {tokens}")

Sentence: Learning Tokenization with ML Archive 
Tokens: ['Learning', 'Tokenization', 'with', 'ML', 'Archive']


### **Word Tokenization:**
Method: Breaking down a text into individual words is a common tokenization approach. Punctuation marks are usually treated as separate tokens.

In [None]:
import re

# Sample sentence
sentence = "Word tokenization, where Punctuation marks are treated as separate tokens."

# Tokenize the sentence into words based on spaces and punctuation
tokens = re.findall(r'\b\w+\b|[.,;!?]', sentence)

# Print the result
print("Original Sentence:", sentence)
print("Word Tokens:", tokens)


Original Sentence: Word tokenization, where Punctuation marks are treated as separate tokens.
Word Tokens: ['Word', 'tokenization', ',', 'where', 'Punctuation', 'marks', 'are', 'treated', 'as', 'separate', 'tokens', '.']


## Tokenization tools

Tokenization can be accomplished using various techniques, and the choice of method often depends on the specific requirements of the natural language processing (NLP) task at hand.

### NLTK Word Tokenize
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries


In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt', quiet=True)

sentence = "NLTK is a powerful library for natural language processing."
tokens_nltk = word_tokenize(sentence)
print("NLTK Tokens:", tokens_nltk)

NLTK Tokens: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', '.']


Sentence Tokenization:
-	Method: Dividing a text into sentences. This is particularly useful for tasks that require understanding the context of sentences.

In [None]:
# Sample text with multiple sentences
text = "Sentence tokenization is crucial. It helps in breaking text into sentences. This is an example text for demonstration purposes."


# Tokenize the text into sentences using sent_tokenize
sentences = sent_tokenize(text)

# Print the result
print("Original Text:")
print(text)
print("\nTokenized Sentences:")
for i, sentence in enumerate(sentences, start=1):
    print(f"Sentence {i}: {sentence}")


Original Text:
Sentence tokenization is crucial. It helps in breaking text into sentences. This is an example text for demonstration purposes.

Tokenized Sentences:
Sentence 1: Sentence tokenization is crucial.
Sentence 2: It helps in breaking text into sentences.
Sentence 3: This is an example text for demonstration purposes.


### spaCy
spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("spaCy provides advanced tokenization capabilities.")
tokens_spacy = [token.text for token in doc]
print("spaCy Tokens:", tokens_spacy)


spaCy Tokens: ['spaCy', 'provides', 'advanced', 'tokenization', 'capabilities', '.']


### Keras Tokenizer
Keras open-source library is one of the most reliable deep learning frameworks. To perform tokenization we use: text_to_word_sequence method from the Class Keras.preprocessing.text class. The great thing about Keras is converting the alphabet in a lower case before tokenizing it, which can be quite a time-saver.

In [None]:
from keras.preprocessing.text import text_to_word_sequence

sentence = "Keras is a high-level neural networks API."
tokens_keras = text_to_word_sequence(sentence)
print("Keras Tokens:", tokens_keras)

Keras Tokens: ['keras', 'is', 'a', 'high', 'level', 'neural', 'networks', 'api']


### Hugging face tokenizer

The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

In [None]:
from transformers import AutoTokenizer

# Replace 'bert-base-uncased' with the desired model name
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example sentence
sentence = "Hugging Face Transformers simplify NLP workflows."

# Tokenize the sentence
tokens_hugging_face = tokenizer(sentence)
print("Hugging Face Tokens:", tokenizer.convert_ids_to_tokens(tokens_hugging_face['input_ids']))


Hugging Face Tokens: ['[CLS]', 'hugging', 'face', 'transformers', 'sim', '##plify', 'nl', '##p', 'work', '##flow', '##s', '.', '[SEP]']


## Text processing

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Sample sentence
sentence = "Text processing after tokenization involves various tasks."

# Tokenization
tokens = word_tokenize(sentence)


1.   Lowercasing:

In [None]:
# Lowercasing
tokens_lower = [token.lower() for token in tokens]

2.   Stopword Removal:


In [None]:
# Tokenization
tokens = word_tokenize(sentence)

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens_lower if token not in stop_words]

3. Lemmatization or Stemming:



In [None]:
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

4. Removing Punctuation and Special Characters:


In [None]:
# Removing punctuation
cleaned_tokens = [token for token in lemmatized_tokens if token not in string.punctuation]

5. Handling Numeric Tokens:

In [None]:
# Removing Numeric Tokens
cleaned_tokens = [token for token in cleaned_tokens if not token.isdigit()]

6. Removing HTML Tags or Markup:

In [None]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    # Use BeautifulSoup to remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    cleaned_text = soup.get_text(separator=' ')
    return cleaned_text

# Sample HTML text
html_text = "<p>This is <b>bold</b> and <i>italic</i> text.</p>"

# Remove HTML tags
cleaned_text = remove_html_tags(html_text)

print("Original HTML Text:", html_text)
print("Cleaned Text (without HTML tags):", cleaned_text)


Original HTML Text: <p>This is <b>bold</b> and <i>italic</i> text.</p>
Cleaned Text (without HTML tags): This is  bold  and  italic  text.
