## Tokenization in NLP using Python

### Terms in Tokenization
- **'punkt'**: Pre-trained model in NLTK for sentence and word tokenization.
- **'punkt_tab'**: Provides additional data for 'punkt'.
- **sent_tokenize**: Splits text into sentences.
- **word_tokenize**: Splits text into words.
- **wordpunct_tokenize**: Splits text into words and punctuation.
- **TreebankWordTokenizer**: Splits text using Penn Treebank conventions.

In [7]:
import nltk
nltk.download('punkt') #a pre-trained model for splitting text into words/sentences
nltk.download('punkt_tab') #a pre-trained model for splitting text into words/sentences

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zainab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Zainab\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [15]:
#paragraph to sentence
from nltk.tokenize import sent_tokenize
#sent_tokenize splits the sentences on the basis of fullstops and !
#paragraph to sentence
corpus = """
The quick brown fox jumps over the lazy dog.
The cat is eating the mouse.
The dog's is barking at the cat!
The mouse is running away from the dog.
The cat! is chasing the mouse.
"""

sentences = sent_tokenize(corpus)
print(sentences)


['\nThe quick brown fox jumps over the lazy dog.', 'The cat is eating the mouse.', "The dog's is barking at the cat!", 'The mouse is running away from the dog.', 'The cat!', 'is chasing the mouse.']


In [17]:
#paragraph to words
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['The',
 'quick',
 'brown',
 'fox',
 'jumps',
 'over',
 'the',
 'lazy',
 'dog',
 '.',
 'The',
 'cat',
 'is',
 'eating',
 'the',
 'mouse',
 '.',
 'The',
 'dog',
 "'s",
 'is',
 'barking',
 'at',
 'the',
 'cat',
 '!',
 'The',
 'mouse',
 'is',
 'running',
 'away',
 'from',
 'the',
 'dog',
 '.',
 'The',
 'cat',
 '!',
 'is',
 'chasing',
 'the',
 'mouse',
 '.']

In [18]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['The',
 'quick',
 'brown',
 'fox',
 'jumps',
 'over',
 'the',
 'lazy',
 'dog',
 '.',
 'The',
 'cat',
 'is',
 'eating',
 'the',
 'mouse',
 '.',
 'The',
 'dog',
 "'",
 's',
 'is',
 'barking',
 'at',
 'the',
 'cat',
 '!',
 'The',
 'mouse',
 'is',
 'running',
 'away',
 'from',
 'the',
 'dog',
 '.',
 'The',
 'cat',
 '!',
 'is',
 'chasing',
 'the',
 'mouse',
 '.']

In [19]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)


['The',
 'quick',
 'brown',
 'fox',
 'jumps',
 'over',
 'the',
 'lazy',
 'dog.',
 'The',
 'cat',
 'is',
 'eating',
 'the',
 'mouse.',
 'The',
 'dog',
 "'s",
 'is',
 'barking',
 'at',
 'the',
 'cat',
 '!',
 'The',
 'mouse',
 'is',
 'running',
 'away',
 'from',
 'the',
 'dog.',
 'The',
 'cat',
 '!',
 'is',
 'chasing',
 'the',
 'mouse',
 '.']