# Text tokenization

In [1]:
! pip install nltk



In [2]:
import nltk
# download the necessary NLTK data

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 1. Word tokenization

Word tokenization is a text transformation technique that divides a sentence or a document into individual words or phrases called tokens. The most 
common way of tokenizing words is splitting text based on whitespace characters such as spaces and tabs. This technique benefits many natural NLP
tasks, such as text classification, sentiment analysis, and machine translation. For example, in English, we use word tokenization to split a sentence 
like “I love eating pizza” into individual tokens, which are “I,” “love,” “eating,” and “pizza.” Similarly, in languages such as French or German,
word tokenization can help identify compound words and other unique linguistic features.


In [3]:
# example 1 : Tokenize a text

text = "This is a sample sentence to be tokenized."
tokens = nltk.word_tokenize(text , language='english')
print(tokens)

['This', 'is', 'a', 'sample', 'sentence', 'to', 'be', 'tokenized', '.']


In [4]:
# example 2 : Tokenize a text

text = """It fills my heart with joy unspeakable to rise in response to the warm and cordial welcome which you have given us. I thank you in the name 
of the most ancient order of monks in the world, I thank you in the name of the mother of religions, and I thank you in the name of millions and 
millions of Hindu people of all classes and sects."""

tokens = nltk.word_tokenize(text , language='english')
print(tokens)

['It', 'fills', 'my', 'heart', 'with', 'joy', 'unspeakable', 'to', 'rise', 'in', 'response', 'to', 'the', 'warm', 'and', 'cordial', 'welcome', 'which', 'you', 'have', 'given', 'us', '.', 'I', 'thank', 'you', 'in', 'the', 'name', 'of', 'the', 'most', 'ancient', 'order', 'of', 'monks', 'in', 'the', 'world', ',', 'I', 'thank', 'you', 'in', 'the', 'name', 'of', 'the', 'mother', 'of', 'religions', ',', 'and', 'I', 'thank', 'you', 'in', 'the', 'name', 'of', 'millions', 'and', 'millions', 'of', 'Hindu', 'people', 'of', 'all', 'classes', 'and', 'sects', '.']


In [5]:
# example 3

import pandas as pd
from nltk.tokenize import word_tokenize
df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
print(df)


   review_id                                               text
0     txt145  The software had a steep learning curve at fir...
1     txt327  I'm really impressed with the user interface o...
2     txt209  The latest update to the software fixed severa...
3     txt825  I encountered a few glitches while using the s...
4     txt878  I was skeptical about trying the software init...
5     txt933  The analytics features have provided us with v...
6     txt718  I appreciate the regular updates that the soft...
7     txt316  I attended a training session for the software...
8     txt247  The software documentation could be more compr...
9     txt515  I've recommended the software to colleagues du...
10    txt913  The software integration with third-party plug...
11    txt341  I'm looking forward to the upcoming release of...
12    txt943  The user community is active and supportive, m...
13    txt688  I've been using the software for a while now, ...
14    txt136  The user interface could u

In [6]:
df['word_tokens'] = df['text'].apply(word_tokenize) 
print(df['word_tokens'])

0     [The, software, had, a, steep, learning, curve...
1     [I, 'm, really, impressed, with, the, user, in...
2     [The, latest, update, to, the, software, fixed...
3     [I, encountered, a, few, glitches, while, usin...
4     [I, was, skeptical, about, trying, the, softwa...
5     [The, analytics, features, have, provided, us,...
6     [I, appreciate, the, regular, updates, that, t...
7     [I, attended, a, training, session, for, the, ...
8     [The, software, documentation, could, be, more...
9     [I, 've, recommended, the, software, to, colle...
10    [The, software, integration, with, third-party...
11    [I, 'm, looking, forward, to, the, upcoming, r...
12    [The, user, community, is, active, and, suppor...
13    [I, 've, been, using, the, software, for, a, w...
14    [The, user, interface, could, use, some, moder...
15    [I, went, for, a, run, and, the, software, did...
Name: word_tokens, dtype: object


### 2. Sentence tokenization

Sentence tokenization is a text transformation technique that involves breaking down a larger text into smaller, individual sentences. We use this
technique in NLP applications, such as sentiment analysis and machine translation, to help better understand the meaning and structure of text. For 
example, a paragraph of text such as “The cat sat on the mat. The dog barked loudly. The sun was shining brightly.” can be divided into
three individual sentences:

“The cat sat on the mat.”

“The dog barked loudly.”

“The sun was shining brightly.”


In [5]:
# example 1 : Tokenize a text

# Download the Punkt tokenizer data
nltk.download('punkt')

# Tokenize a text into sentences
text = "This is a sentence. This is another sentence."
sentences = nltk.sent_tokenize(text)

print(sentences)

['This is a sentence.', 'This is another sentence.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
# example 2 : Tokenize a text

text = """It fills my heart with joy unspeakable to rise in response to the warm and cordial welcome which you have given us. I thank you in the name 
of the most ancient order of monks in the world, I thank you in the name of the mother of religions, and I thank you in the name of millions and 
millions of Hindu people of all classes and sects."""

sentences = nltk.sent_tokenize(text , language='english')
print(sentences)

['It fills my heart with joy unspeakable to rise in response to the warm and cordial welcome which you have given us.', 'I thank you in the name \nof the most ancient order of monks in the world, I thank you in the name of the mother of religions, and I thank you in the name of millions and \nmillions of Hindu people of all classes and sects.']


In [7]:
# example 3

import pandas as pd
from nltk.tokenize import sent_tokenize

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
print(df)


   review_id                                               text
0     txt145  The software had a steep learning curve at fir...
1     txt327  I'm really impressed with the user interface o...
2     txt209  The latest update to the software fixed severa...
3     txt825  I encountered a few glitches while using the s...
4     txt878  I was skeptical about trying the software init...
5     txt933  The analytics features have provided us with v...
6     txt718  I appreciate the regular updates that the soft...
7     txt316  I attended a training session for the software...
8     txt247  The software documentation could be more compr...
9     txt515  I've recommended the software to colleagues du...
10    txt913  The software integration with third-party plug...
11    txt341  I'm looking forward to the upcoming release of...
12    txt943  The user community is active and supportive, m...
13    txt688  I've been using the software for a while now, ...
14    txt136  The user interface could u

In [8]:
df['sentence_tokens'] = df['text'].apply(sent_tokenize) 
print(df['sentence_tokens'])

0     [The software had a steep learning curve at fi...
1     [I'm really impressed with the user interface ...
2     [The latest update to the software fixed sever...
3     [I encountered a few glitches while using the ...
4     [I was skeptical about trying the software ini...
5     [The analytics features have provided us with ...
6     [I appreciate the regular updates that the sof...
7     [I attended a training session for the softwar...
8     [The software documentation could be more comp...
9     [I've recommended the software to colleagues d...
10    [The software integration with third-party plu...
11    [I'm looking forward to the upcoming release o...
12    [The user community is active and supportive, ...
13    [I've been using the software for a while now,...
14    [The user interface could use some modernizati...
15    [I went for a run and the software did a good ...
Name: sentence_tokens, dtype: object


### 3. Character level tokenization

Character tokenization is a text transformation technique that divides text into individual or group characters. Unlike other types of tokenization
that split text into words or phrases, character tokenization treats each character as a separate token. This technique is essential when working with 
languages that do not use spaces between words or when analyzing text at a more granular level. For example, we use character tokenization in Chinese 
or Japanese to break down text into individual characters, which can help analyze the language’s structure and identify specific 
characters or patterns.


In [7]:
text = "Hello, world!"

# Character-level tokenization
tokens = list(text)

print(tokens)

['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']


In [1]:
import pandas as pd

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
print(df)


   review_id                                               text
0     txt145  The software had a steep learning curve at fir...
1     txt327  I'm really impressed with the user interface o...
2     txt209  The latest update to the software fixed severa...
3     txt825  I encountered a few glitches while using the s...
4     txt878  I was skeptical about trying the software init...
5     txt933  The analytics features have provided us with v...
6     txt718  I appreciate the regular updates that the soft...
7     txt316  I attended a training session for the software...
8     txt247  The software documentation could be more compr...
9     txt515  I've recommended the software to colleagues du...
10    txt913  The software integration with third-party plug...
11    txt341  I'm looking forward to the upcoming release of...
12    txt943  The user community is active and supportive, m...
13    txt688  I've been using the software for a while now, ...
14    txt136  The user interface could u

In [2]:
df['character_tokens'] = df['text'].apply(list)
print(df['character_tokens'])

0     [T, h, e,  , s, o, f, t, w, a, r, e,  , h, a, ...
1     [I, ', m,  , r, e, a, l, l, y,  , i, m, p, r, ...
2     [T, h, e,  , l, a, t, e, s, t,  , u, p, d, a, ...
3     [I,  , e, n, c, o, u, n, t, e, r, e, d,  , a, ...
4     [I,  , w, a, s,  , s, k, e, p, t, i, c, a, l, ...
5     [T, h, e,  , a, n, a, l, y, t, i, c, s,  , f, ...
6     [I,  , a, p, p, r, e, c, i, a, t, e,  , t, h, ...
7     [I,  , a, t, t, e, n, d, e, d,  , a,  , t, r, ...
8     [T, h, e,  , s, o, f, t, w, a, r, e,  , d, o, ...
9     [I, ', v, e,  , r, e, c, o, m, m, e, n, d, e, ...
10    [T, h, e,  , s, o, f, t, w, a, r, e,  , i, n, ...
11    [I, ', m,  , l, o, o, k, i, n, g,  , f, o, r, ...
12    [T, h, e,  , u, s, e, r,  , c, o, m, m, u, n, ...
13    [I, ', v, e,  , b, e, e, n,  , u, s, i, n, g, ...
14    [T, h, e,  , u, s, e, r,  , i, n, t, e, r, f, ...
15    [I,  , w, e, n, t,  , f, o, r,  , a,  , r, u, ...
Name: character_tokens, dtype: object


### 4. Syntactic tokenization

In [8]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Tokenize text into sentences
text = "The quick brown fox jumps over the lazy dog."
sentences = nltk.sent_tokenize(text)

# Tokenize sentences into words and part-of-speech tags
for sentence in sentences:
  words = nltk.word_tokenize(sentence)
  pos_tags = nltk.pos_tag(words)
  print(pos_tags)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


### 5. Tokenize and detokenize using treebank

In [15]:
# example 1

from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
d = TreebankWordDetokenizer()
t = TreebankWordTokenizer()

tokens = t.tokenize(s)
print('here are my tokens :' , tokens)

print(d.detokenize(tokens))

here are my tokens : ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.


In [16]:
# example 2

from nltk.tokenize.treebank import TreebankWordDetokenizer

tokens = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
d = TreebankWordDetokenizer()
print(d.detokenize(tokens))

tokens = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!', 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!']
print(d.detokenize(tokens))


hello, i can't feel my feet! Help!!
hello, i can't feel; my feet! Help!! He said: Help, help?!


In [None]:
## 6. Tokenize using regexp

In [19]:
# examples

from nltk.tokenize import RegexpTokenizer

s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

# 1
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
print(tokenizer.tokenize(s))

# 2
tokenizer = RegexpTokenizer(r'\s+', gaps=True)
print(tokenizer.tokenize(s))

# 3
capword_tokenizer = RegexpTokenizer(r'[A-Z]\w+')
print(capword_tokenizer.tokenize(s))

# 4
smallword_tokenizer = RegexpTokenizer(r'[a-z]\w+')
print(smallword_tokenizer.tokenize(s))



['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
['Good', 'New', 'York', 'Please', 'Thanks']
['ood', 'muffins', 'cost', 'in', 'ew', 'ork', 'lease', 'buy', 'me', 'two', 'of', 'them', 'hanks']
