# Text Tokenization

Tokenization is the process by which a big quantity of text is divided into smaller parts called tokens.

A paragraph is composed of sentences. Each sentence, in turn is composed of many words.

The segmentation can be done on a given paragraph of text at the level of sentences (tokenizing sentences) and each sentence can further be tokenized into many words it is composed of (word tokenization).



In [None]:
# Install NLTK if not already installed...uncomment the next cell and run it.
#! pip install nltk

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Tokenize Words

From a given text, we can extract out individual words by word tokenization.

In [3]:
# Word Tokenization
from nltk.tokenize import word_tokenize

text = "A quick brown fox jumps over the lazy dogs."
list_of_words = word_tokenize(text)
print(list_of_words)


['A', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs', '.']


## Split Method

Split method can be used to extract out individual words from a string.


In [4]:

text ="learn python from me and make your life easy".split()

print("After Split:",text)


After Split: ['learn', 'python', 'from', 'me', 'and', 'make', 'your', 'life', 'easy']


## Tokenize Sentence

Sentence tokenizer extracts individual sentences from a given text.

In [5]:
# Sentence Tokenization

from nltk.tokenize import sent_tokenize

text = "Beauty lies in the eyes of the beholder. Do not open your eyes. A thing of beauty is a joy forever."
sentences = sent_tokenize(text)
print(sentences)


['Beauty lies in the eyes of the beholder.', 'Do not open your eyes.', 'A thing of beauty is a joy forever.']


### Tokenize Text Read from a Disk File

In [6]:
# Reading local files from disk
f = open('Test.txt')
raw = f.read()
print(raw)

This is a NLTK test file.
How are you India?
Let's go green.
Playing in the garden.
Roaming in the breeze.
Why is the child crying?


In [7]:
# Tokenize sentences
sentences = sent_tokenize(raw)
print(sentences)

['This is a NLTK test file.', 'How are you India?', "Let's go green.", 'Playing in the garden.', 'Roaming in the breeze.', 'Why is the child crying?']


In [8]:
# Tokenize words in the a sentence
for sentence in sentences:
    list_of_words = word_tokenize(sentence)
    print(list_of_words)


['This', 'is', 'a', 'NLTK', 'test', 'file', '.']
['How', 'are', 'you', 'India', '?']
['Let', "'s", 'go', 'green', '.']
['Playing', 'in', 'the', 'garden', '.']
['Roaming', 'in', 'the', 'breeze', '.']
['Why', 'is', 'the', 'child', 'crying', '?']


In [9]:
f = open('Test.txt','r')
for line in f:
 print(line.strip())

This is a NLTK test file.
How are you India?
Let's go green.
Playing in the garden.
Roaming in the breeze.
Why is the child crying?


## Tokenize a String Read from Keyboard

In [10]:
s = input("Enter some text: ")
print ("You typed", len(nltk.word_tokenize(s)), "words.")

Enter some text: If you miss the train I am on, you will know that I am gone. 
You typed 17 words.


In [11]:
list_of_words = word_tokenize(s)
print(list_of_words)

['If', 'you', 'miss', 'the', 'train', 'I', 'am', 'on', ',', 'you', 'will', 'know', 'that', 'I', 'am', 'gone', '.']
