<a href="https://colab.research.google.com/github/amazingashis/Natural-Language-Processing/blob/main/Practical_Assignment_II_Ashish_Adhikari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tokenization**
Tokenization is one of the fundamental
things to do in any text-processing activity. Tokenization can be thought of
as a segmentation technique wherein you are trying to break down larger
pieces of text chunks into smaller meaningful ones. Tokens generally
comprise words and numbers, but they can be extended to include
punctuation marks, symbols, and, at times, understandable emoticons

In [1]:
sentence = "The capital of China is Beijing"
sentence.split()

['The', 'capital', 'of', 'China', 'is', 'Beijing']

In [2]:
sentence = "China's capital is Beijing"
sentence.split()

["China's", 'capital', 'is', 'Beijing']

In [3]:
sentence = "Beijing is where we'll go"
sentence.split()

['Beijing', 'is', 'where', "we'll", 'go']

In [4]:
sentence = "I'm going to travel to Beijing"
sentence.split()

["I'm", 'going', 'to', 'travel', 'to', 'Beijing']

In [5]:
sentence = "A friend is pursuing his M.S from Beijing"
sentence.split()

['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']

In [6]:
sentence = "Beijing is a cool place!!! :-P <3 #Awesome"
sentence.split()

['Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome']

# **Different types of tokenizers**

### **Regular expressions-based tokenizers**
The nltk package in Python provides a regular expressions-based
tokenizers (RegexpTokenizer) functionality. It can be used to tokenize or split a sentence based on a provided regular expression. 

In [7]:
from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

### **Blankline Tokenizer**
There are other tokenizers built on top of the RegexpTokenizer, such as the
BlankLine tokenizer, which tokenizes a string treating blank lines as
delimiters where blank lines are those that contain no characters except
spaces or tabs.


In [8]:
from nltk.tokenize import BlanklineTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n\n I want a book as well"
tokenizer = BlanklineTokenizer()
tokenizer.tokenize(s)

['A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.',
 'I want a book as well']

### **Tree Bank Tokenizer**
The Treebank tokenizer also uses regular expressions to tokenize text
according to the Penn Treebank.The Treebank tokenizer does a great job of splitting contractions such as
doesn't to does and n't. It further identifies periods at the ends of lines and
eliminates them. Punctuation such as commas is split if followed by
whitespaces.


In [9]:
from nltk.tokenize import TreebankWordTokenizer
s = "I'm going to buy a Rolex watch which doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'which',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

###  **Tweet Tokenizer**
the rise of social media has given rise to an informal
language wherein people tag each other using their social media handles and
use a lot of emoticons, hashtags, and abbreviated text to express themselves.
We need tokenizers in place that can parse such text and make things more
understandable. TweetTokenizer caters to this use case significantly.

In [10]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

In [11]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

# Understanding word Normalization

### Stemming

In [12]:
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)


('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [13]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
           'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']

In [14]:
from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


In [15]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


## **Lemmatization**

#### **WordNet lemmatizer**

In [16]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [17]:
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token in token_list])
print("The lemmatized output is: ", lemmatized_output)

The tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting in effort to enhance our understanding of Lemmatization


#### **POS Tagging**

In [18]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

In [19]:
from nltk.corpus import wordnet

##This is a common method which is widely used across the NLP community of practitioners and readers

def get_part_of_speech_tags(token):
    
    """Maps POS tags to first character lemmatize() accepts.
    We are focussing on Verbs, Nouns, Adjectives and Adverbs here."""

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    tag = nltk.pos_tag([token])[0][1][0].upper()
    
    return tag_dict.get(tag, wordnet.NOUN)

In [20]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

We be put in effort to enhance our understand of Lemmatization


####  **Comparision**

In [21]:
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


#### **Spacy lemmatizer**

In [22]:
import spacy
nlp =spacy.load('en_core_web_sm')
doc = nlp("We are putting in efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])

'-PRON- be put in effort to enhance -PRON- understanding of lemmatization'

#### **StopsWord Removal**

In [23]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


"her, should've, is, after, ma, during, again, don, will, himself, same, don't, didn, which, he, up, wouldn't, other, myself, s, t, o, ve, i, most, yours, we, had, any, won, few, mustn't, hers, him, herself, this, very, on, between, itself, doesn, ourselves, who, has, wouldn, hasn't, hadn, doesn't, through, my, haven, re, hadn't, because, aren, more, there, than, can, needn't, couldn, ours, in, am, it, ain, that, weren't, couldn't, off, shan, needn, being, be, for, not, that'll, them, but, she's, it's, while, only, such, to, below, of, down, did, those, their, both, ll, you'll, its, by, aren't, no, does, d, doing, over, from, so, or, just, been, she, your, under, our, you're, a, won't, shouldn't, as, you've, isn, too, theirs, how, at, until, mightn, and, before, nor, shouldn, whom, wasn, were, mustn, above, you, why, all, shan't, yourselves, once, me, some, yourself, own, was, haven't, with, wasn't, having, should, you'd, if, then, didn't, do, the, here, now, what, his, weren, they, y,

In [24]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']

stop = set(stopwords.words('english'))

sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"

for word in wh_words:
    stop.remove(word)

sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

#### **Case Folding**

In [25]:
s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s

'we are putting in efforts to enhance our understanding of lemmatization'

### **N-Grams**

In [26]:
from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

In [27]:
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

### **Removing HTMl Tags**

In [28]:
html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

My First HeadingMy first paragraph.
