<a href="https://colab.research.google.com/github/gauravthombare/gauravthombare/blob/main/tokenize_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Standard way

In [None]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = text.split()

print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


Python split method do not consider punctuation as separate token.
[reference](https://https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp)

# Using Regex

In [2]:
import re

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


[] :    A set of characters.
\w :    Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).
\+  :    One or more occurrences.

# Tokenization using NLTK
Natual language tool kit

module
word_tokenize() - word tokenization
sent_tokenize() - sentence tokenization

In [3]:
from nltk.tokenize import word_tokenize

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = word_tokenize(text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


# Tokenization using spaCy

open source library for advanced natual language processing written in python and cython

In [7]:
from spacy.lang.en import English

In [8]:
# Load English tokenizer. 
# nlp object will be used to create 'doc' object which uses preprecoessing pipeline's components such as tagger, parser, NER and word vectors
nlp = English()

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

# Now we will process above text using 'nlp' object. Which is use to create documents with linguistic annotations and various nlp properties
my_doc = nlp(text)

# Above step has already tokenized our text but its in doc format, so lets write fo loop to create list of it
token_list = []
for token in my_doc:
    token_list.append(token.text)

print(token_list)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


# Tokenization using Keras
Keras is opensource neural network library written in python. It is easy to use and it is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML
To perform word tokenization we use the text_to_word_sequence() method from the keras.preprocessing.text class
By default, this function automatically does 3 things:
Splits words by space (split=” “).
Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n’).
Converts text to lowercase (lower=True).

In [9]:
from keras.preprocessing.text import text_to_word_sequence

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = text_to_word_sequence(text)
print(tokens)

['there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'we', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


# Tokenization using Gensim
Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.
We are going to use tokenize() from gensim.utility class for word tokenization.
Unlike other libraries Gensim has separate method split_sentences() from class gensim.summarization.textcleaner for sentence tokenization.

In [10]:
from gensim.utils import tokenize

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = list(tokenize(text))
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']
