### <b> Problem Statement </b>
1. Extract Sample document and apply following document preprocessing methods:<br>
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.<br><br>
2. Create representation of document by calculating Term Frequency and Inverse Document 
Frequency.


Importing Libraries

### <b> 1. Tokenization of the sentence

In [1]:
# nltk.download('punkt')
# OR
# import nltk
# nltk.download()

In [2]:
from nltk.tokenize import sent_tokenize

text = "Hello everyone. My name is Siddhant. Today is my DSBDA practical exam. This is sentence tokenization"

tokenized_sentence = sent_tokenize(text) 

print(tokenized_sentence)

['Hello everyone.', 'My name is Siddhant.', 'Today is my DSBDA practical exam.', 'This is sentence tokenization']


### Splitting sentence in words

In [3]:
from nltk.tokenize import word_tokenize
  
text = "Hello everyone. My name is Siddhant. Today is my DSBDA practical exam. This is sentence tokenization"

tokenized_word = word_tokenize(text)

print(tokenized_word)

['Hello', 'everyone', '.', 'My', 'name', 'is', 'Siddhant', '.', 'Today', 'is', 'my', 'DSBDA', 'practical', 'exam', '.', 'This', 'is', 'sentence', 'tokenization']


### <b> 2. POS Tagging

Parts-Of-Speech Tagging for the following sentence

In [4]:
import nltk

text = "Hello everyone. My name is Siddhant. Today is my DSBDA practical exam. This is sentence tokenization"

postag = word_tokenize(text)

nltk.pos_tag(postag)

[('Hello', 'NNP'),
 ('everyone', 'NN'),
 ('.', '.'),
 ('My', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Siddhant', 'NNP'),
 ('.', '.'),
 ('Today', 'NN'),
 ('is', 'VBZ'),
 ('my', 'PRP$'),
 ('DSBDA', 'NNP'),
 ('practical', 'JJ'),
 ('exam', 'NN'),
 ('.', '.'),
 ('This', 'DT'),
 ('is', 'VBZ'),
 ('sentence', 'JJ'),
 ('tokenization', 'NN')]

### <b> 3. Stop Word Removal

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

Following are the stop words in english

In [5]:
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

print(stop_words)

{'should', 'ma', 'hadn', 'hasn', 'than', 'herself', "wouldn't", "you're", 'theirs', 'are', 'won', "it's", 'below', 'were', 'above', 'why', 'no', 'will', 'her', 'itself', 'shouldn', 'if', 'you', 'himself', 'between', 'mustn', 'does', 'your', 'needn', 'aren', 'y', 'was', 'off', 'own', 'such', 'is', 'ain', 'those', 'a', 'once', 'am', 'other', 'yourselves', 'she', "should've", 'he', 'by', 'more', 'over', 'as', "don't", 'the', 'because', "she's", 'they', 'from', 'there', 'i', 'me', 'doesn', 'now', 'couldn', 'some', 'been', 'most', "weren't", 'my', 'whom', 'just', 'ourselves', 'did', 'haven', 'be', 'during', 'wouldn', "mightn't", 'our', 'themselves', 'him', 'these', 'which', 'where', 've', 'all', "isn't", 'them', "you'd", "shan't", 's', 'an', 'for', 'isn', 'or', 'too', 'what', 'myself', 'but', 'few', 're', 'before', 'll', 'yourself', 'being', 'it', 'has', 'wasn', 'not', "shouldn't", 'on', "hasn't", "doesn't", "haven't", 'here', 'can', 'after', 'had', "hadn't", 'out', 'in', 'hers', "you've", 

In [6]:
filtered_sentence = []
for w in tokenized_word:
    if w not in stop_words:
        filtered_sentence.append(w)    # append = add

print("Tokenized Sentence: ", tokenized_word)
print("Filtered Sentence: ", filtered_sentence)

Tokenized Sentence:  ['Hello', 'everyone', '.', 'My', 'name', 'is', 'Siddhant', '.', 'Today', 'is', 'my', 'DSBDA', 'practical', 'exam', '.', 'This', 'is', 'sentence', 'tokenization']
Filtered Sentence:  ['Hello', 'everyone', '.', 'My', 'name', 'Siddhant', '.', 'Today', 'DSBDA', 'practical', 'exam', '.', 'This', 'sentence', 'tokenization']


#### <b> 4. Stemming

Lowering the inflection in words to their root forms

In [14]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example = "My name is siddhangt. Send sending sent"
output = ps.stem(example)
print("Example: ", example)
print("output: ", output)


stemmed_words = []

for w in filtered_sentence:
    stemmed_words.append(ps.stem(w))

print("Filtered Sentence: ", filtered_sentence)
print("Stemmed Sentence: ", stemmed_words)

Example:  My name is siddhangt. Send sending sent
output:  my name is siddhangt. send sending s
Filtered Sentence:  ['Hello', 'everyone', '.', 'My', 'name', 'Siddhant', '.', 'Today', 'DSBDA', 'practical', 'exam', '.', 'This', 'sentence', 'tokenization']
Stemmed Sentence:  ['hello', 'everyon', '.', 'my', 'name', 'siddhant', '.', 'today', 'dsbda', 'practic', 'exam', '.', 'thi', 'sentenc', 'token']


#### <b> 4. Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("disfunctional :", lemmatizer.lemmatize("disfunctional"))
print("disfunctional :", lemmatizer.lemmatize("disfunctional"))
 
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
disfunctional : disfunctional
disfunctional : disfunctional
better : good


In [16]:
import nltk

text = "My name is John"

postag = word_tokenize(text)

nltk.pos_tag(postag)

[('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('John', 'NNP')]