## **Day 3**

### Basics of Text Processing using NLTK (Contd.)

#### Stemming

Since we are trying to generate as much insights as we can from the unstructured text, its important to have the text in its simplest form. Stemming is a process to normalize words. It also helps convert various forms of words, used in various context to its stem. This may not give a grammatically correct or dictionary word. For instance, stem of words driving and drives is drive. There are multiple algorithms for stemming. In this one, I am only trying the Porter's Stemmer since it has lower error rate compared to many others. 


#### Lemmatization

Lemmatization is a similar process to stem the words to its similar base. However, it doesn't just truncate the words, it provides a dictionary word. It is usually done without context and can result in a lower accuracy rate. Lemmatization also takes on part of speech values and has a lower processing speed compared to stemming.



In [57]:
# Load the necessary packages

import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [58]:
# Passing encoding = "ISO-8859-1" as I was having trouble loading the txt file.

file = open("text.txt", encoding = "ISO-8859-1")
text = file.read()

print (text)
print(len(text))

Do play they miss give so up. Words to up style of since world. We leaf to snug on no need. Way own uncommonly travelling now acceptance bed compliment solicitude. Dissimilar admiration so terminated no in contrasted it. Advantages entreaties mr he apartments do. Limits far yet turned highly repair parish talked six. Draw fond rank form nor the day eat. 

Necessary ye contented newspaper zealously breakfast he prevailed. Melancholy middletons yet understood decisively boy law she. Answer him easily are its barton little. Oh no though mother be things simple itself. Dashwood horrible he strictly on as. Home fine in so am good body this hope. 

Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by. Old post took and ask seen fact rich. Man entrance settling believed eat joy. Money as drift begin on to. Comparison up insipidity especially discovered 

In [65]:
#Using word_tokenize to generate the list of words from the above text.

word = word_tokenize(text)
word[:10]

['Do', 'play', 'they', 'miss', 'give', 'so', 'up', '.', 'Words', 'to']

In [66]:
# Using Porter Stemmer for stemming

stemmer = PorterStemmer()
stemmed = []

for w in word:
    stemmed.append((stemmer.stem(w)))

stemmed[:10]

['Do', 'play', 'they', 'miss', 'give', 'so', 'up', '.', 'word', 'to']

In [70]:
# Using WordNetLemmatizer for lemmatizing the words

lemmatizer = WordNetLemmatizer()
lemmatized = []

for w in word:
    lemmatized.append((lemmatizer.lemmatize(w)))

lemmatized[0:10]

['Do', 'play', 'they', 'miss', 'give', 'so', 'up', '.', 'Words', 'to']

In [71]:
# PoS for all the word

tag = nltk.pos_tag(word)
tag[:10]

[('Do', 'VBP'),
 ('play', 'VB'),
 ('they', 'PRP'),
 ('miss', 'VBP'),
 ('give', 'VBP'),
 ('so', 'RB'),
 ('up', 'RB'),
 ('.', '.'),
 ('Words', 'NNS'),
 ('to', 'TO')]

In [72]:

# First, tokenizing the sentences from the above text
sentence = sent_tokenize(text)

# Creating an object for tfidf vectorizing later
vectorizer = TfidfVectorizer(norm = None)

# Generating output for TF IDF
tfidf = vectorizer.fit_transform(sentence).toarray()

vectorizer.get_feature_names()[:10]



# To be continued ... 

['acceptance',
 'add',
 'admiration',
 'advantages',
 'affronting',
 'all',
 'am',
 'an',
 'and',
 'answer']