# Tokenization using NLTK Library in Python
### What is Tokenization in NLP?
Tokenization is the process of breaking up the original raw text into component pieces which are known as tokens. Tokenization is usually the initial step for further NLP operations like stemming, lemmatization, text mining, text classification, sentiment analysis, language translation, chatbot creation, etc.

Almost always while working with NLP projects you will have to tokenize the data.


### What are Tokens?
Tokens are broken pieces of the original text that are produced after tokenization. Tokens are the basic building blocks of text -everything that helps us understand the meaning of the text is derived from tokens and the relationship to one another. For example, the character is a token in a word, a word is a token in a sentence, and a sentence is a token in a paragraph.

### NLTK Tokenize Package
nltk.tokenize is the package provided by the NLTK module that is used in the process of tokenization.

In order to install the NLTK package run the following command.

- pip install nltk
- Then, enter the Python shell in your terminal by simply typing python
- Type import nltk
- nltk.download(‘all’)

### 1. Character Tokenization in Python
Character tokenization is the process of breaking text into a list of characters. This can be achieved quite easily in Python without the need for the NLTK library.

Let us understand this with the help of an example. In the example below, we used list comprehension to convert the text into a list of characters

In [17]:
# Example
text="Hello world"
lst=[x for x in text]
print(lst)

['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']


### 2. Word Tokenization with NLTK word_tokenize()
Word tokenization is the process of breaking a string into a list of words also known as tokens. In NLTK we have a module word_tokeinize() to perform word tokenization.

Let us understand this module with the help of an example.

In the examples below, we have passed the string sentence to word_tokenize() and tokenize it into a list of words.

In [18]:
# example 1
from nltk.tokenize import word_tokenize
text="Hello there! Welcome to the programming world."
print(word_tokenize(text))

['Hello', 'there', '!', 'Welcome', 'to', 'the', 'programming', 'world', '.']


In [19]:
# example 2
from nltk.tokenize import word_tokenize
text="We are learning Natural Language Processing."
print(word_tokenize(text))

['We', 'are', 'learning', 'Natural', 'Language', 'Processing', '.']


In [3]:
# example 3
nltk.word_tokenize('My very beautiful girlfrind is pregnant')

['My', 'very', 'beautiful', 'girlfrind', 'is', 'pregnant']

In [20]:
# example 4
nltk.word_tokenize("my own world. Is it real")

['my', 'own', 'world', '.', 'Is', 'it', 'real']

In [21]:
# example 5
nltk.word_tokenize("Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided.")

['Last',
 'week',
 ',',
 'the',
 'University',
 'of',
 'Cambridge',
 'shared',
 'its',
 'own',
 'research',
 'that',
 'shows',
 'if',
 'everyone',
 'wears',
 'a',
 'mask',
 'outside',
 'home',
 ',',
 'dreaded',
 '‘',
 'second',
 'wave',
 '’',
 'of',
 'the',
 'pandemic',
 'can',
 'be',
 'avoided',
 '.']

### 3. Sentence Tokenization with NLTK sent_tokenize()
Sentence tokenization is the process of breaking a paragraph or a string containing sentences into a list of sentences. In NLTK, sentence tokenization can be done using sent_tokenize().

In the examples below, we have passed text of multiple lines to sent_tokenize() which tokenizes it into a list of sentences.

In [22]:
# example 1
from nltk.tokenize import sent_tokenize
text="It’s easy to point out someone else’s mistake. Harder to recognize your own."
print(sent_tokenize(text))

['It’s easy to point out someone else’s mistake.', 'Harder to recognize your own.']


In [23]:
# example 2
from nltk.tokenize import sent_tokenize
text="Laughing at our mistakes can lengthen our own life. Laughing at someone else's can shorten it."
print(sent_tokenize(text))

['Laughing at our mistakes can lengthen our own life.', "Laughing at someone else's can shorten it."]


In [24]:
# example 3
nltk.sent_tokenize('The year was 1991. It was a dangerous time. The Country was agitating for political puluralism. Many leaders were arrested and changed with sedition. Among them was Raila Odinga - A son of former vice president')

['The year was 1991.',
 'It was a dangerous time.',
 'The Country was agitating for political puluralism.',
 'Many leaders were arrested and changed with sedition.',
 'Among them was Raila Odinga - A son of former vice president']

In [25]:
# example 4
nltk.sent_tokenize("Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided.")

['Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided.']

In [26]:
# example 5
nltk.sent_tokenize("""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """)

['The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.',
 'Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.',
 'Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels?',
 'Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves.',
 'This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g.',
 'radio, movies, television, the internet,

### 4. Whitespace Tokenization with NLTK WhitespaceTokenizer()
WhitespaceTokenizer() module of NLTK tokenizes a string on whitespace (space, tab, newline). It is an alternate option for split().

In the example below, we have passed a sentence to WhitespaceTokenizer() which then tokenizes it based on the whitespace.

In [29]:
# example
from nltk.tokenize import WhitespaceTokenizer
s="Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
Tokenizer=WhitespaceTokenizer()
print(Tokenizer.tokenize(s))

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']


### 5. Word Punctuation Tokenization with NLTK WordPunctTokenizer()
WordPunctTokenizer() module of NLTK tokenizes a string on punctuations.
In the below example, we have tokenized the string on punctuations by passing it to WordPuntTokenizer() function.

In [28]:
# example
from nltk.tokenize import WordPunctTokenizer
text="We're moving to L.A.!"
Tokenizer=WordPunctTokenizer()
print(Tokenizer.tokenize(text))

['We', "'", 're', 'moving', 'to', 'L', '.', 'A', '.!']


### 6. Removing Punctuations with NLTK RegexpTokenizer()
We can remove punctuation from a sentence by using RegexpTokenizer() function of NLTK.
In the example below, we have used RegexpTokenizer() to convert the text into a list of words that don’t contain punctuations and then joined them using .join() function.

In [30]:
# example
from nltk.tokenize import RegexpTokenizer
text="The children - Pierre, Laura, and Ashley - went to the store."
tokenizer = RegexpTokenizer(r"\w+")
lst=tokenizer.tokenize(text)
print(' '.join(lst))

The children Pierre Laura and Ashley went to the store


### 7. Tokenization Dataframe Columns using NLTK
Quite often you will need to tokenized data in a column of pandas dataframe. This can be achieved easily by using apply and lambda function of Python with the NLTK tokenization functions.

Let us understand this better with the help of an example below.

In [31]:
#example;
import pandas as pd
from nltk.tokenize import  word_tokenize

df = pd.DataFrame({'Phrases': ['The greatest glory in living lies not in never falling, but in rising every time we fall.', 
                              'The way to get started is to quit talking and begin doing.', 
                              'If life were predictable it would cease to be life, and be without flavor.',
                              "If you set your goals ridiculously high and it's a failure, you will fail above everyone else's success."]})
df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['Phrases']), axis=1)
df.head()

Unnamed: 0,Phrases,tokenized
0,The greatest glory in living lies not in never...,"[The, greatest, glory, in, living, lies, not, ..."
1,The way to get started is to quit talking and ...,"[The, way, to, get, started, is, to, quit, tal..."
2,If life were predictable it would cease to be ...,"[If, life, were, predictable, it, would, cease..."
3,If you set your goals ridiculously high and it...,"[If, you, set, your, goals, ridiculously, high..."


### NLTK Tokenize vs Split
The split function is usually used to separate strings with a specified delimiter, e.g. in a tab-separated file, we can use str.split(‘\t’) or when we are trying to split a string by the newline \n when our textfile has one sentence per line or when we are trying to split by any specific character. But the same can’t be performed using the NLTK tokenize functions.

In [32]:
# example
from nltk.tokenize import word_tokenize
text="This is the first line of text.\nThis is the second line of text."
print(text)
print(text.split('\n')) #Spliting the text by '\n'.
print(text.split('\t')) #Spliting the text by '\t'.
print(text.split('s'))  #Spliting by charecter 's'.
print(text.split())  #Spliting the text by space.
print(word_tokenize(text))    #Tokenizing by using word

This is the first line of text.
This is the second line of text.
['This is the first line of text.', 'This is the second line of text.']
['This is the first line of text.\nThis is the second line of text.']
['Thi', ' i', ' the fir', 't line of text.\nThi', ' i', ' the ', 'econd line of text.']
['This', 'is', 'the', 'first', 'line', 'of', 'text.', 'This', 'is', 'the', 'second', 'line', 'of', 'text.']
['This', 'is', 'the', 'first', 'line', 'of', 'text', '.', 'This', 'is', 'the', 'second', 'line', 'of', 'text', '.']
