### Tokenization 


Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.


<br>
<hr>

In [1]:
import nltk 
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import regexp_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
text = "This is my sentence. that will be tokenize."

In [3]:
# word tokenization 
word_tokenize(text)

['This', 'is', 'my', 'sentence', '.', 'that', 'will', 'be', 'tokenize', '.']

In [4]:
# sentence tokenization 
sent_tokenize(text)

['This is my sentence.', 'that will be tokenize.']

In [5]:
len(sent_tokenize(text))

2

In [11]:
# contraction error tokenizer 
print(word_tokenize("can't"))

['ca', "n't"]


<br>
<hr>
<br>


#### Custom Tokenization 



The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

split standard contractions, e.g. don't -> do n't and they'll -> they 'll

treat most punctuation characters as separate tokens

split off commas and single quotes, when followed by whitespace

In [14]:
# custom tokenization 
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize('Hello World.'))
print(tokenizer.tokenize(text))

['Hello', 'World', '.']
['This', 'is', 'my', 'sentence.', 'that', 'will', 'be', 'tokenize', '.']


<br>
<hr>
<br>


#### Punctuation tokenization 

An alternative word tokenizer that splits all punctuation into separate tokens.


Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

In [7]:
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize("Can't is a contraction."))

['Can', "'", 't', 'is', 'a', 'contraction', '.']


<br>
<hr>
<br>

#### Regex Tokenization 

how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.

In [8]:
# Regex tokenization 
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("Can't is a contraction."))

["Can't", 'is', 'a', 'contraction']


In [9]:
print(regexp_tokenize("Can't is a contraction.", "[\w']+"))

["Can't", 'is', 'a', 'contraction']


In [10]:
tokenizer = RegexpTokenizer('\s+', gaps = True)
print(tokenizer.tokenize("Can't is a contraction."))

["Can't", 'is', 'a', 'contraction.']
