# Text Mining Demo 

There has been an explosion in the volume of unstructured data on the internet in the age of social media. This data is useful and can be used to answer several important business questions, for example sentiment analysis of products or optimizing recommendations for e-commerce platforms. <br>
Different methods are used to solve each of these problems as we will see in the Text Mining course. A toolkit which deals with unstructured natural language data very efficiently is nltk. <br> 
We will see a short demo of some of the most basic functionalities of nltk. <br>
For reading more about nltk, you can refer to: https://www.nltk.org/ 

In [3]:
import nltk

## 1. Tokenization

Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

In [4]:
from nltk.tokenize import sent_tokenize
text="Hello, how are you doing today? The sky is pinkish-blue. The weather is great, and city is awesome. Hope you have a great day."
tokenized_text=sent_tokenize(text)
print(tokenized_text)

['Hello, how are you doing today?', 'The sky is pinkish-blue.', 'The weather is great, and city is awesome.', 'Hope you have a great day.']


In [13]:
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'sky', 'is', 'pinkish-blue', '.', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'Hope', 'you', 'have', 'a', 'great', 'day', '.']


## 2. Stopwords 
Stopwords are the extremely common words that occur in our corpus with very high frequency. For example, the words 'the', 'a', 'I' are extremely common words in the English language. <br> 
Hence, during pre-processing for some algorithms we might want to remove these words from our text. That can be done as shown below.

In [17]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'himself', 'aren', 'these', 'hasn', 'can', 'by', 've', 'not', "you're", 'themselves', "shan't", 'been', 'yourself', 'it', 'against', 'doesn', 'after', 'him', 'has', 'wouldn', "shouldn't", 'down', "you'll", 'other', 'more', "aren't", 'couldn', 'hadn', "haven't", 'ours', 'very', 'some', 'for', 'haven', 'o', 'mightn', 'nor', 'didn', 'is', 'through', 's', 'were', 'same', "didn't", 'needn', 'no', "mustn't", 'myself', "couldn't", "wasn't", 'if', 'just', "don't", 'yourselves', 'each', 'how', 'that', 'our', 'we', 'again', 'out', 'most', 'an', 'but', 'me', 'wasn', 'up', 'to', "hasn't", 'at', 'when', 'off', 'them', 'should', 'few', 'below', 'doing', 'own', 'was', 'those', 'then', 'had', 'over', 'a', 'does', 'don', 'mustn', 'will', 'now', "weren't", 'i', 'ourselves', 'shan', 'any', 'ain', "she's", 'yours', 'too', 'he', 'this', 'because', 'than', 'who', 'further', 'its', 'are', 'won', 'here', 'weren', 'do', "should've", 'such', 'during', 'as', 'only', 'their', 'whom', "that'll", 'so', 'y', 'ma', 

In [19]:
filtered_text=[]
for w in tokenized_word:
    if w not in stop_words:
        filtered_text.append(w)
print("Filterd Text:",filtered_text)

Filterd Text: ['Hello', ',', 'today', '?', 'The', 'sky', 'pinkish-blue', '.', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'Hope', 'great', 'day', '.']


## 3. Stemming 

Normalization considers another type of noise in the text. For example, creation, creating, created, all words reduce to a common word "create". It reduces derivationally related forms of a word to a common root word.

In [26]:
related_words=['connecting','connected','connection']
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_words=[]
for w in related_words:
    stemmed_words.append(ps.stem(w))

print("PorterStemmer: Stemmed words:",stemmed_words_ps)

PorterStemmer: Stemmed words: ['connect', 'connect', 'connect']


## 4. Lemmatization

Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. 

In [29]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word))
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: flying
Lemmatized Word: fly
Stemmed Word: fli


## 5. Frequency distribution

In [14]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
print(fdist)
print(fdist.most_common(5))
# samples indicates the number of unique elements in the text
# outcomes indicates the total number of elements in the text

<FreqDist with 22 samples and 30 outcomes>
[('is', 3), ('.', 3), (',', 2), ('you', 2), ('The', 2)]


## 6. POS tagging
The primary target of POS tagging is to label each word in the text with a Part-Of-Speech label. The labels assigned are from [NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS] etc. 

In [30]:
sent="NLP is a rapidly developing field."
tokens=nltk.word_tokenize(sent)
print(tokens)

['NLP', 'is', 'a', 'rapidly', 'developing', 'field', '.']


In [31]:
nltk.pos_tag(tokens)

[('NLP', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('rapidly', 'RB'),
 ('developing', 'JJ'),
 ('field', 'NN'),
 ('.', '.')]