### Tokenization:
The process of splitting an input document into meaningful chunks is called Tokenization and that chunk is called as a token. We can think of a token as a useful unit for further semantic processing. It can be a word, a sentence, a paragraph, or anything else. Let's look at few examples.

There are various popular libraries available to do it like NLTK and Spacy.
We will use NLTK and look at a few examples.

In [1]:
# from nltk.tokenize import WhitespaceTokenizer
# from nltk.tokenize import WordPunctTokenizer
# from nltk.tokenize import TreebankWordTokenizer
# from nltk.tokenize import RegexpTokenizer

### WhitespaceTokenizer: This tokenizer uses white space to split the document into tokens. 

In [2]:
from nltk.tokenize import WhitespaceTokenizer

In [3]:
doc1 = "Learning NLP is good, isn't it?"
tokenizer = WhitespaceTokenizer()
tokenizer.tokenize(doc1)

['Learning', 'NLP', 'is', 'good,', "isn't", 'it?']

### WordPunctTokenizer: This tokenizer uses punctuation to split the document into tokens. 

In [4]:
from nltk.tokenize import WordPunctTokenizer

In [5]:
doc2 = "Learning NLP is good, isn't it?"
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(doc2)

['Learning', 'NLP', 'is', 'good', ',', 'isn', "'", 't', 'it', '?']

### TreebankWordTokenizer: This tokenizer uses word semantics to split the document into tokens. 

In [6]:
from nltk.tokenize import TreebankWordTokenizer

In [7]:
doc3 = "Learning NLP is good, isn't it?"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(doc3)

['Learning', 'NLP', 'is', 'good', ',', 'is', "n't", 'it', '?']

#### Note: Out the above three tokenizers, TreebankWordTokenizer splits the document using word semantics, which is quite appropriate.  

### RegexpTokenizer: This tokenizer uses custom regular expression to split the document into tokens. 

In [8]:
from nltk.tokenize import RegexpTokenizer

In [9]:
# comma_reg_exp = "\,"
# period_reg_exp = "\."
# question_mark_reg_exp = "\?"

In [10]:
comma_reg_exp = "\,"
doc4 = "Learning NLP is good, very good, very nice."
tokenizer = RegexpTokenizer(comma_reg_exp, gaps=True)
tokenizer.tokenize(doc4)

['Learning NLP is good', ' very good', ' very nice.']

In [11]:
#doc4 is splitted into three tokens using comma as a separator

In [12]:
period_reg_exp = "\."
doc5 = "I am leanring NLP. I am studying NLP. You should learn NLP."
tokenizer = RegexpTokenizer(period_reg_exp, gaps=True)
tokenizer.tokenize(doc5)

['I am leanring NLP', ' I am studying NLP', ' You should learn NLP']

In [13]:
#doc5 is splitted into three tokens using period as a separator

In [14]:
question_mark_reg_exp = "\?"
doc6 = "Is learning NLP good? Yes, it is good."
tokenizer = RegexpTokenizer(question_mark_reg_exp, gaps=True)
tokenizer.tokenize(doc6)

['Is learning NLP good', ' Yes, it is good.']

In [15]:
#doc6 is splitted into three tokens using question mark as a separator

 ### Let's take a document corpus to perform tokenization

In [16]:
def get_tokens(corpus, tokenizer):
    corpus_tokens = []
    for doc in corpus:
        doc = doc.lower()
        tokens = tokenizer.tokenize(doc)
        corpus_tokens.append(tokens)
    return corpus_tokens

In [17]:
from nltk.tokenize import TreebankWordTokenizer

In [18]:
corpus = ['This movie is good', 'That movie was bad', 'Not a good movie at all']
tokenizer = TreebankWordTokenizer()

In [19]:
tokens = get_tokens(corpus, tokenizer)

In [20]:
print(tokens)

[['this', 'movie', 'is', 'good'], ['that', 'movie', 'was', 'bad'], ['not', 'a', 'good', 'movie', 'at', 'all']]


In [21]:
# the above function can be written using list comprehension as follows
tokens_lc = [tokenizer.tokenize(doc.lower()) for doc in corpus]

In [22]:
print(tokens_lc)

[['this', 'movie', 'is', 'good'], ['that', 'movie', 'was', 'bad'], ['not', 'a', 'good', 'movie', 'at', 'all']]


#### Output Analysis:  The above output contains tokens for each document.  After performing tokenization, these tokens can be used further in the semantic analysis pipeline such as to perform stemming or lemmatization, checking if this is a stopword or not, etc.