# Text Mining

Mining your textual data for information for patterns, insights, information etc. This is under NLP (Natural Language Processing).

Applications
1. Speech recognition
2. Spam filtering
3. E-commerce personalization
4. Sentiment Analysis

Text -> Text preprocessing > Text transformation > Attribute selection (to reduce dimension. PCA/LDA will not work here)/Pattern detection > Data mining > Interpretation/Evaluation

NLTK is a set of python modules used to work with human lang data

__1. Toeknization__ is the process of breaking stream of text into words (word tokenizer) or sentences(sentence tokenizer) using spaces and punctuation (fullstop, !, etc).

__2. Stop word removal__ - such as a, an, and, the etc
There is a stopwords package in nltk.corpus You need to initialize it with the Language you want to initiatize.

__3. Get the root words__
Stemming and Lemmatization are ways to get the root word.

__Stemming__ involves reducing a word to stem or base (root) form by removing affixes. e.g. helps/helping/helped/help/helper (form) -> help (stem) where suffix is -s, -ing, -ed, -, -er
The PorterStermmer in nltk.stem is used for stemming.

One disadvantage of stemming is that some words may become meaningless when stemmed e.g. beautiful to beauti hence the need for Lemmatization

__Lemmatization__ uses vocabulary list (inbuilt dictionary) and morphological analysis (POS of a word) to get the root word. Context is taken into account. If you lemmatize the word 'beautiful', you will get the root as beauty.
WordNetLemmatizer package in nltk.stem is used. Without a POS tag, the lemmatizer assumes everything is a noun. e.g. saw can be a verb or a noun e.g. I saw a bird (verb so lemmatizer will reduce it to it's root 'see'). Saw the wood (noun so lemmatizer leaves it as 'saw'). It's for English corpus words.


__4. POS Tagging__ (POS-Parts of Speech)
POS tagging marks words in the corpus to a corresponding part of a speech e.g. verb, noun tag based on it's context and definition. 

POS tags are useful for lemmatization, in building NERs (Name Entity Recognition) and extracting relations between words.

__Named Entity Recognition__ (NER)
Is done __after__ POS tagging. Seeks to extract a real-world entity from the text and sort it into pre-defined categories such as the names of persons, organizations, locations etc.

There are 7 types of NER tags available
- Person
- Organization 
- Percent
- Date
- Geo-political entity
- Location
- Currency

__5. Information retrieval__ Extracting relevant information from source.

# 1. Import nltk

In [1]:
import nltk
# nltk.download()

In [2]:
text = "Once when King Krishnadevaraya had gone to survey the jail, two burglars who were prisoners there, asked for his mercy. They told him that they were experts at burglary and could help the king in catching other thieves.The king being a kind ruler asked his guards to release them but with a condition. He told the burglars that he would release them and appoint them as his spies only if they could break into his advisor Tenali Raman’s house and steal valuables from there. The thieves agreed for the challenge.That same night the two thieves went to Tenali Raman’s house and hid behind some bushes. After dinner, when Tenali Raman came out for a stroll, he heard some rustling in the bushes. He at once perceived the existence of thieves in his garden."

text

'Once when King Krishnadevaraya had gone to survey the jail, two burglars who were prisoners there, asked for his mercy. They told him that they were experts at burglary and could help the king in catching other thieves.The king being a kind ruler asked his guards to release them but with a condition. He told the burglars that he would release them and appoint them as his spies only if they could break into his advisor Tenali Raman’s house and steal valuables from there. The thieves agreed for the challenge.That same night the two thieves went to Tenali Raman’s house and hid behind some bushes. After dinner, when Tenali Raman came out for a stroll, he heard some rustling in the bushes. He at once perceived the existence of thieves in his garden.'

# 2. Simple Preprocessing

__Convert the text into lowercase__ so that words with same meaning are not seen as different because one is upper and the other lower e.g. King vs king

In [3]:
text = text.lower()
text

'once when king krishnadevaraya had gone to survey the jail, two burglars who were prisoners there, asked for his mercy. they told him that they were experts at burglary and could help the king in catching other thieves.the king being a kind ruler asked his guards to release them but with a condition. he told the burglars that he would release them and appoint them as his spies only if they could break into his advisor tenali raman’s house and steal valuables from there. the thieves agreed for the challenge.that same night the two thieves went to tenali raman’s house and hid behind some bushes. after dinner, when tenali raman came out for a stroll, he heard some rustling in the bushes. he at once perceived the existence of thieves in his garden.'

# 3. Sentence tokenization

In [4]:
from nltk import sent_tokenize

# initialize the method and pass the text
sent_text = sent_tokenize(text)

print(sent_text)

['once when king krishnadevaraya had gone to survey the jail, two burglars who were prisoners there, asked for his mercy.', 'they told him that they were experts at burglary and could help the king in catching other thieves.the king being a kind ruler asked his guards to release them but with a condition.', 'he told the burglars that he would release them and appoint them as his spies only if they could break into his advisor tenali raman’s house and steal valuables from there.', 'the thieves agreed for the challenge.that same night the two thieves went to tenali raman’s house and hid behind some bushes.', 'after dinner, when tenali raman came out for a stroll, he heard some rustling in the bushes.', 'he at once perceived the existence of thieves in his garden.']


# 4. Word tokenization

In [5]:
from nltk import word_tokenize

word_text = []

for sent in sent_text:
    word_text.append(word_tokenize(sent))
    
print(word_text)

[['once', 'when', 'king', 'krishnadevaraya', 'had', 'gone', 'to', 'survey', 'the', 'jail', ',', 'two', 'burglars', 'who', 'were', 'prisoners', 'there', ',', 'asked', 'for', 'his', 'mercy', '.'], ['they', 'told', 'him', 'that', 'they', 'were', 'experts', 'at', 'burglary', 'and', 'could', 'help', 'the', 'king', 'in', 'catching', 'other', 'thieves.the', 'king', 'being', 'a', 'kind', 'ruler', 'asked', 'his', 'guards', 'to', 'release', 'them', 'but', 'with', 'a', 'condition', '.'], ['he', 'told', 'the', 'burglars', 'that', 'he', 'would', 'release', 'them', 'and', 'appoint', 'them', 'as', 'his', 'spies', 'only', 'if', 'they', 'could', 'break', 'into', 'his', 'advisor', 'tenali', 'raman', '’', 's', 'house', 'and', 'steal', 'valuables', 'from', 'there', '.'], ['the', 'thieves', 'agreed', 'for', 'the', 'challenge.that', 'same', 'night', 'the', 'two', 'thieves', 'went', 'to', 'tenali', 'raman', '’', 's', 'house', 'and', 'hid', 'behind', 'some', 'bushes', '.'], ['after', 'dinner', ',', 'when', 't

# 5. Stop word removal

In [6]:
from nltk.corpus import stopwords

stopwords_en = set(stopwords.words('english'))

print(stopwords_en)

       

{'out', 'to', 'couldn', 'each', 'because', 'ours', 'through', 'about', 'some', 'ain', 'herself', 'down', 'him', 'now', 'here', 'have', 'while', 'how', 'any', 'her', 'of', 'll', 'myself', "doesn't", 'm', 't', 'yourselves', 'she', 'does', 'yourself', 're', 'my', 'haven', 'but', "she's", "mightn't", 'so', 'our', 'was', 'very', 'for', 'both', 'an', 'during', 'hasn', 'no', 'shouldn', 'by', "didn't", "that'll", 'had', 'aren', 'same', 'most', 'ourselves', 'wasn', 'there', 'they', 'needn', 'just', "mustn't", 'theirs', 'are', 'with', 'what', "aren't", 'i', 'itself', 'he', 'this', 'and', 'can', "should've", 'such', 'hadn', "you've", 'up', 'mightn', 'shan', 'y', "shan't", "isn't", "couldn't", 'not', 'being', 's', 'am', 'has', 'after', 'above', 'be', 'over', 'those', 'in', 'we', 'hers', 'nor', 'should', 'doesn', 'will', 'won', 'than', 'until', 'ma', 'these', "don't", 'when', 'further', 'his', 'which', 'their', 'from', 'your', 'all', 'before', 'below', 'once', "weren't", "it's", 'doing', "won't", '

# 6. Eliminate stop words

In [8]:
# using list comprehension
# word_text_filtered = [word for word in word_text if word not in stopwords_en]   # store only words not in stopwords_en

word_text_filtered = []
for w in word_tokenize(text):
    if w not in stopwords_en:
        word_text_filtered.append(w)
        
print(word_text_filtered)

# to customize your own stop words, initialize a list and use it to filter

['king', 'krishnadevaraya', 'gone', 'survey', 'jail', ',', 'two', 'burglars', 'prisoners', ',', 'asked', 'mercy', '.', 'told', 'experts', 'burglary', 'could', 'help', 'king', 'catching', 'thieves.the', 'king', 'kind', 'ruler', 'asked', 'guards', 'release', 'condition', '.', 'told', 'burglars', 'would', 'release', 'appoint', 'spies', 'could', 'break', 'advisor', 'tenali', 'raman', '’', 'house', 'steal', 'valuables', '.', 'thieves', 'agreed', 'challenge.that', 'night', 'two', 'thieves', 'went', 'tenali', 'raman', '’', 'house', 'hid', 'behind', 'bushes', '.', 'dinner', ',', 'tenali', 'raman', 'came', 'stroll', ',', 'heard', 'rustling', 'bushes', '.', 'perceived', 'existence', 'thieves', 'garden', '.']


# Stop words and filtering combined code
from nltk.corpus import stopwords
stopwords_en = set(stopwords.words('english'))
print(stopwords_en)
word_text_filtered = []
for w in word_tokenize(text):
    if w not in stopwords_en:
        word_text_filtered.append(w)
print(word_text_filtered)

# 7. Stemming

Every word has to be stemmed

In [11]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

for word in word_text_filtered:
    print(stemmer.stem(word))

king
krishnadevaraya
gone
survey
jail
,
two
burglar
prison
,
ask
merci
.
told
expert
burglari
could
help
king
catch
thieves.th
king
kind
ruler
ask
guard
releas
condit
.
told
burglar
would
releas
appoint
spi
could
break
advisor
tenali
raman
’
hous
steal
valuabl
.
thiev
agre
challenge.that
night
two
thiev
went
tenali
raman
’
hous
hid
behind
bush
.
dinner
,
tenali
raman
came
stroll
,
heard
rustl
bush
.
perceiv
exist
thiev
garden
.


# 8. Lemmatization

In [12]:
# if lemmatization is not working, download nltk.download('wordnet') 
from nltk.stem import WordNetLemmatizer

# instantiate lemmatizer
lemmatizer =  WordNetLemmatizer()

for word in word_text_filtered:
    print(lemmatizer.lemmatize(word))

king
krishnadevaraya
gone
survey
jail
,
two
burglar
prisoner
,
asked
mercy
.
told
expert
burglary
could
help
king
catching
thieves.the
king
kind
ruler
asked
guard
release
condition
.
told
burglar
would
release
appoint
spy
could
break
advisor
tenali
raman
’
house
steal
valuable
.
thief
agreed
challenge.that
night
two
thief
went
tenali
raman
’
house
hid
behind
bush
.
dinner
,
tenali
raman
came
stroll
,
heard
rustling
bush
.
perceived
existence
thief
garden
.


Notice how lemmatization is has the correct full root e.g. 'theif' compared to stemming 'theiv'

# 8. POS Tagging

It's best to use the lemmatized words so save the output to a list. Instructor was having issues with it and used the original text instead below

In [13]:
# nltk.download('average_perceptron_tagger')
tagged_text = nltk.pos_tag(word_tokenize(text))

print(tagged_text)

[('once', 'RB'), ('when', 'WRB'), ('king', 'VBG'), ('krishnadevaraya', 'NN'), ('had', 'VBD'), ('gone', 'VBN'), ('to', 'TO'), ('survey', 'NN'), ('the', 'DT'), ('jail', 'NN'), (',', ','), ('two', 'CD'), ('burglars', 'NNS'), ('who', 'WP'), ('were', 'VBD'), ('prisoners', 'NNS'), ('there', 'RB'), (',', ','), ('asked', 'VBD'), ('for', 'IN'), ('his', 'PRP$'), ('mercy', 'NN'), ('.', '.'), ('they', 'PRP'), ('told', 'VBD'), ('him', 'PRP'), ('that', 'IN'), ('they', 'PRP'), ('were', 'VBD'), ('experts', 'NNS'), ('at', 'IN'), ('burglary', 'NN'), ('and', 'CC'), ('could', 'MD'), ('help', 'VB'), ('the', 'DT'), ('king', 'NN'), ('in', 'IN'), ('catching', 'VBG'), ('other', 'JJ'), ('thieves.the', 'NN'), ('king', 'VBG'), ('being', 'VBG'), ('a', 'DT'), ('kind', 'NN'), ('ruler', 'NN'), ('asked', 'VBD'), ('his', 'PRP$'), ('guards', 'NNS'), ('to', 'TO'), ('release', 'VB'), ('them', 'PRP'), ('but', 'CC'), ('with', 'IN'), ('a', 'DT'), ('condition', 'NN'), ('.', '.'), ('he', 'PRP'), ('told', 'VBD'), ('the', 'DT'),

# 9. Named Entity Recognition

In [14]:
text3 = 'Sundar is the CEO of Google which is an American company.'

# tokenize
tokenised = nltk.word_tokenize(text3)

# POS tag
tagged_text3 = nltk.pos_tag(tokenised)

ne_chunked = nltk.ne_chunk(tagged_text3)     # see chunking and chinking
# print(ne_chunked)

# iterate and extract only the nouns and provide tags for them
named_entities = []
for tagged_tree in ne_chunked:
    if hasattr(tagged_tree , 'label'):
        entity_name =' '.join(c[0] for c in tagged_tree.leaves())
        entity_type = tagged_tree.label()
        named_entities.append((entity_name,entity_type))
        
print(named_entities)

[('Sundar', 'GPE'), ('CEO of Google', 'ORGANIZATION'), ('American', 'GPE')]


Notice how it inaccurately tagged 'Sundar' as 'GPE' instead of 'PERSON'. To improve the tagging accuracy, use other options besides nltk.

# End of Text processing section

If you have a large text or observe that your texts are not tagged properly, use other solutions like Spacy instead of nltk or creat chunks

# Chunking 

Noun(noun/adjective) or verb (pronoun/verb/adverb) phrase extraction before passing it to NER hence making easier to identify the NER tags

VBG-helping

# Chinking

Chinking is the process of removing the sequence of unwanted tokens from a chunk. If the sequence spans the entire sentence, then the entire sentence is removed.

A chink is the part that has to be removed from a chunk. You need to 