## ***Tokenization***

---

In [1]:
text = "APIs for manipulating documents loaded into the browser. The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page, etc. Every time you see a popup window appear on a page or some new content displayed, for example, that's the DOM in action. Find out more about these types of API in Manipulating documents."

> **Using inbuild split() method**

In [2]:
# Using inbuild split method

words = text.split()
print(words)
print('Number of Words :', len(words),'\n')

sentences = text.split('.')
print(sentences)
print('Number of Sentences : ',len(sentences),'\n')

['APIs', 'for', 'manipulating', 'documents', 'loaded', 'into', 'the', 'browser.', 'The', 'most', 'obvious', 'example', 'is', 'the', 'DOM', '(Document', 'Object', 'Model)', 'API,', 'which', 'allows', 'you', 'to', 'manipulate', 'HTML', 'and', 'CSS', '—', 'creating,', 'removing', 'and', 'changing', 'HTML,', 'dynamically', 'applying', 'new', 'styles', 'to', 'your', 'page,', 'etc.', 'Every', 'time', 'you', 'see', 'a', 'popup', 'window', 'appear', 'on', 'a', 'page', 'or', 'some', 'new', 'content', 'displayed,', 'for', 'example,', "that's", 'the', 'DOM', 'in', 'action.', 'Find', 'out', 'more', 'about', 'these', 'types', 'of', 'API', 'in', 'Manipulating', 'documents.']
Number of Words : 75 

['APIs for manipulating documents loaded into the browser', ' The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page, etc', " Every time you see a popup window appear o

> **Using RegEx module**

In [3]:
# Using RegEx module

import re

words = re.findall("[\w']+",text)
word = re.findall("[\w]+",text)
print(words)
print('Number of Words :',len(words),'\n')

sentences = re.compile('[$.?!] ').split(text)
print(sentences)
print('Number of Sentences :',len(sentences),'\n')

['APIs', 'for', 'manipulating', 'documents', 'loaded', 'into', 'the', 'browser', 'The', 'most', 'obvious', 'example', 'is', 'the', 'DOM', 'Document', 'Object', 'Model', 'API', 'which', 'allows', 'you', 'to', 'manipulate', 'HTML', 'and', 'CSS', 'creating', 'removing', 'and', 'changing', 'HTML', 'dynamically', 'applying', 'new', 'styles', 'to', 'your', 'page', 'etc', 'Every', 'time', 'you', 'see', 'a', 'popup', 'window', 'appear', 'on', 'a', 'page', 'or', 'some', 'new', 'content', 'displayed', 'for', 'example', "that's", 'the', 'DOM', 'in', 'action', 'Find', 'out', 'more', 'about', 'these', 'types', 'of', 'API', 'in', 'Manipulating', 'documents']
Number of Words : 74 

['APIs for manipulating documents loaded into the browser', 'The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page, etc', "Every time you see a popup window appear on a page or some ne

> **Using NLTK module**

In [4]:
import nltk

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
# Using NLTK module

from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(text)
print(words)
print('Number of Words :',len(words),'\n')

sentences = sent_tokenize(text)
print(sentences)
print('Number of Sentences :',len(sentences),'\n')

['APIs', 'for', 'manipulating', 'documents', 'loaded', 'into', 'the', 'browser', '.', 'The', 'most', 'obvious', 'example', 'is', 'the', 'DOM', '(', 'Document', 'Object', 'Model', ')', 'API', ',', 'which', 'allows', 'you', 'to', 'manipulate', 'HTML', 'and', 'CSS', '—', 'creating', ',', 'removing', 'and', 'changing', 'HTML', ',', 'dynamically', 'applying', 'new', 'styles', 'to', 'your', 'page', ',', 'etc', '.', 'Every', 'time', 'you', 'see', 'a', 'popup', 'window', 'appear', 'on', 'a', 'page', 'or', 'some', 'new', 'content', 'displayed', ',', 'for', 'example', ',', 'that', "'s", 'the', 'DOM', 'in', 'action', '.', 'Find', 'out', 'more', 'about', 'these', 'types', 'of', 'API', 'in', 'Manipulating', 'documents', '.']
Number of Words : 88 

['APIs for manipulating documents loaded into the browser.', 'The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page

##  ***Stemming***


***



In [7]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

porter = PorterStemmer()

def stemmer(para):
    stemmed_lst=[]
    tokenLst = word_tokenize(para)
    for token in tokenLst:
        stemmed_lst.append(porter.stem(token))
    return ' '.join(stemmed_lst)

stemmed_sentence = stemmer(text)

print('Original Text: \n{0}\nCount:{1}\n'.format(text,len(text)))
print('Stemmed Sentence: \n{0}\nCount:{1}\n'.format(stemmed_sentence,len(stemmed_sentence)))

Original Text: 
APIs for manipulating documents loaded into the browser. The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page, etc. Every time you see a popup window appear on a page or some new content displayed, for example, that's the DOM in action. Find out more about these types of API in Manipulating documents.
Count:443

Stemmed Sentence: 
api for manipul document load into the browser . the most obviou exampl is the dom ( document object model ) api , which allow you to manipul html and css — creat , remov and chang html , dynam appli new style to your page , etc . everi time you see a popup window appear on a page or some new content display , for exampl , that 's the dom in action . find out more about these type of api in manipul document .
Count:412



## ***Lemmatization***


***

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [11]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [12]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

wordnet_lemmat = WordNetLemmatizer()

def lemmatizer(para):
    lemmatLst = []
    tokenLst = word_tokenize(para)
    for token in tokenLst:
        lemmatLst.append(wordnet_lemmat.lemmatize(token))
    return ' '.join(lemmatLst)

lemmatized_sentence = lemmatizer(text)

print('Original Text: \n{0}\nCount:{1}\n'.format(text,len(text)))
print('Lemmatized Sentence : \n{0}\nCount:{1}\n'.format(lemmatized_sentence,len(lemmatized_sentence)))

Original Text: 
APIs for manipulating documents loaded into the browser. The most obvious example is the DOM (Document Object Model) API, which allows you to manipulate HTML and CSS — creating, removing and changing HTML, dynamically applying new styles to your page, etc. Every time you see a popup window appear on a page or some new content displayed, for example, that's the DOM in action. Find out more about these types of API in Manipulating documents.
Count:443

Lemmatized Sentence : 
APIs for manipulating document loaded into the browser . The most obvious example is the DOM ( Document Object Model ) API , which allows you to manipulate HTML and CSS — creating , removing and changing HTML , dynamically applying new style to your page , etc . Every time you see a popup window appear on a page or some new content displayed , for example , that 's the DOM in action . Find out more about these type of API in Manipulating document .
Count:452



In [14]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

wordnet_lemmat = WordNetLemmatizer()

def lemmatizer(para):
    lemmatLst = []
    tokenLst = word_tokenize(para)
    for token in tokenLst:
        if token != wordnet_lemmat.lemmatize(token):
            print(wordnet_lemmat.lemmatize(token))
lemmatizer(text)

document
style
type
document
