# Tokenization
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

    **word tokenize**
    **sentence tokenize**

## Tokenization of Words
We use the method **word_tokenize()** to split a sentence into words. The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications. It can also be provided as input for further text cleaning steps such as punctuation removal, numeric character removal or stemming. Machine learning models need numeric data to be trained and make a prediction. Word tokenization becomes a crucial part of the text (string) to numeric data conversion.

In [2]:
#To start using NLTK in Jupyter we have to download punkt: Punkt Sentence Tokenizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\IT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [4]:
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))
#Text variable is passed in word_tokenize module and printed the result. This module breaks each word with punctuation which you can see in the output.
#Tokenization of Sentences

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']


## Tokenization of Sentences
Sub-module available for the above is sent_tokenize. 
For accomplishing a task of counting average words per sentence, you need both sentence tokenization as well as words to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.

In [7]:
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))
#Further sent module parsed that sentences and show output. It is clear that this function breaks each sentence.

['God is Great!', 'I won a lottery.']


In [8]:
#We have 12 words and two sentences for the same input.

# POS (Part-Of-Speech) Tagging & Chunking with NLTK

## POS Tagging
Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.

    Input: Everything to permit us.

    Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
    
Every token above has a meaning that can be looked for from the table available in NLTK. 
For example above:

    NN means noun, 
    TO is the infinite marker(to), 
    VB means verb 
    PRP means personal pronoun (hers, herself, him,himself)

In [10]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\IT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [26]:
from nltk import pos_tag
from nltk import RegexpParser
text ="learn python from GL and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)

After Split: ['learn', 'python', 'from', 'GL', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('python', 'NN'), ('from', 'IN'), ('GL', 'NNP'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]


## Chunking
Chunking is used to add more structure to the sentence by following parts of speech (POS) tagging. It is also known as shallow parsing. The resulted group of words is called "chunks."

The primary usage of chunking is to make a group of "noun phrases." The parts of speech are combined with regular expressions.


For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below:

    chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
here:

    "." means Any character except new line
    "*" means Match 0 or more repetitions
    "?" means Match 0 or 1 repetitions
    

In [27]:
# EARLIER OUTPUT:
#[('learn', 'JJ'), ('python', 'NN'), ('from', 'IN'), ('greatlearning', 'NN'), 
#('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)

After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
  (mychunk learn/JJ)
  (mychunk python/NN)
  from/IN
  (mychunk GL/NNP and/CC)
  make/VB
  (mychunk study/NN easy/JJ))


In [19]:
# The conclusion from the above example: "make" is a verb which is not included in the rule, so it is not tagged as mychunk.
#If we convert make ro made it will be added in our my chunk

In [31]:
output.draw() 
# It will draw the pattern graphically which can be seen in Noun Phrase chunking 

# Stemming and Lemmatization with Python NLTK

Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.

For example, the root word is "eat" and it's variations are "eats, eating, eaten and like so". 

Example: 

    He was riding.	
    He was taking the ride.
In the above two sentences, the meaning is the same.But for machines, both sentences are different. Thus it became hard to convert it into the same data row. 

In case we do not provide the same data-set, then machine fails to predict. So it is necessary to differentiate the meaning of each word to prepare the dataset for machine learning. And here stemming is used to categorize the same type of data by getting its root word.

In [33]:
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()  # An object is created which belongs to class nltk.stem.porter.PorterStemmer.
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)

wait
wait
wait
wait


It can also be concluded that stemming is considered as an important preprocessing step because it removed redundancy in the data and variations in the same word. As a result, data is filtered which will help in better machine training.

In [34]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello GLguru, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
    rootWord=ps.stem(w)
    print(rootWord)

hello
glguru
,
you
have
to
build
a
veri
good
site
and
I
love
visit
your
site
.


Stemming is a data-preprocessing module. The English language has many variations of a single word. These variations create ambiguity in machine learning training and prediction. To create a successful model, it's vital to filter such words and convert to the same type of sequenced data using stemming. Also, this is an important technique to get row data from a set of sentence and removal of redundant data also known as normalization.

# Lemmatization

Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. It helps in returning the base or dictionary form of a word, which is known as the lemma.
Text preprocessing includes both stemming as well as lemmatization. There is a difference between these both. Lemmatization is preferred over the former because of the below reason.

## Difference between stemming an Lemmatization
Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms

In [36]:
#Stemming code

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer  = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))  

Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri


In [39]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [40]:
#Lemmatization code

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


If you look stemming for studies and studying, output is same (studi) but lemmatizer provides different lemma for both tokens study for studies and studying for studying. So when we need to make feature set to train machine, it would be great if lemmatization is preferred.

# Removing Stop Words

In [42]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [45]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.']


# Bag of Words
Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

On a high level BOW involves following words:

    Clean Text
    Tokenize
    Build Vocab 
    Generate vectors
Generated vectors can be input to your machine learning algorithm.

In [47]:
#Sample Paragraph
text = """Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."""

In [48]:
text

'Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commen

### Step #1 : We will first preprocess the data, in order to:

    Convert text to lower case.
    Remove all non-word characters.
    Remove all punctuations.

In [49]:
import nltk 
import re 
import numpy as np 
  
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)): 
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i])    #remove non-word
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])   # remove punctuations

In [50]:
dataset

['beans ',
 'i was trying to explain to somebody as we were flying in that s corn ',
 'that s beans ',
 'and they were very impressed at my agricultural knowledge ',
 'please give it up for amaury once again for that outstanding introduction ',
 'i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here ',
 'i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have ',
 'and it s great to see you governor ',
 'i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today ',
 'and i am deeply honored at the paul douglas award that is being given to me ',
 'he is somebody who set the path for so much outstanding public service here in illinois ',
 'now i want to start by addressing the elephant in the room ',
 'i know people are still wonde

### Step #2 : Obtaining most frequent words in our text.

We will apply the following steps to generate our model.

    We declare a dictionary to hold our bag of words.
    Next we tokenize each sentence to words.
    Now for each word in sentence, we check if the word exists in our dictionary.
    If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [51]:
# Creating the Bag of Words model 
word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

In [52]:
word2count

{'beans': 2,
 'i': 12,
 'was': 1,
 'trying': 1,
 'to': 8,
 'explain': 1,
 'somebody': 3,
 'as': 1,
 'we': 2,
 'were': 2,
 'flying': 1,
 'in': 5,
 'that': 4,
 's': 3,
 'corn': 1,
 'and': 7,
 'they': 1,
 'very': 1,
 'impressed': 1,
 'at': 4,
 'my': 1,
 'agricultural': 1,
 'knowledge': 1,
 'please': 1,
 'give': 1,
 'it': 3,
 'up': 1,
 'for': 5,
 'amaury': 1,
 'once': 1,
 'again': 1,
 'outstanding': 2,
 'introduction': 1,
 'have': 3,
 'a': 2,
 'bunch': 1,
 'of': 3,
 'good': 1,
 'friends': 1,
 'here': 5,
 'today': 2,
 'including': 1,
 'who': 4,
 'served': 1,
 'with': 1,
 'is': 4,
 'one': 1,
 'the': 9,
 'finest': 1,
 'senators': 1,
 'country': 1,
 're': 1,
 'lucky': 1,
 'him': 1,
 'your': 1,
 'senator': 1,
 'dick': 1,
 'durbin': 1,
 'also': 1,
 'noticed': 1,
 'by': 2,
 'way': 1,
 'former': 1,
 'governor': 2,
 'edgar': 1,
 'haven': 1,
 't': 2,
 'seen': 1,
 'long': 1,
 'time': 1,
 'somehow': 1,
 'he': 2,
 'has': 1,
 'not': 1,
 'aged': 1,
 'great': 1,
 'see': 1,
 'you': 1,
 'want': 2,
 'thank':

In our model, we have a total of 118 words. However when processing large texts, the number of words could reach millions. We do not need to use all those words. Hence, we select a particular number of most frequently used words. To implement this we use:

In [55]:
import heapq 
freq_words = heapq.nlargest(100, word2count, key=word2count.get) 
#where 100 denotes the number of words we want. If our text is large, we feed in a larger number.

In [56]:
len(freq_words)

100

Step #3 : Building the Bag of Words model

        In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. 
        If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.
This can be implemented with the help of following code:

In [57]:
X = [] 
for data in dataset: 
    vector = [] 
    for word in freq_words: 
        if word in nltk.word_tokenize(data): 
            vector.append(1) 
        else: 
            vector.append(0) 
    X.append(vector) 
X = np.asarray(X) 

In [60]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])