# 11.14 Text Mining - Processing

Unstructured data

Steps:
        - Import the nltk library
        - Perform Tokenization - breaking into words and sentences
        - Perform Stemming and Lemmatization
        - Remove Stopwords
        - Perform Named Entity Recognition        
        
Other terms:

- N-grams - assigns probabilities to sequenecs of words and sentences
- stop words removal: NL words with little meaning. You can find a list of stopwords in nltk.corpus for the lang
- stemming: reduce to stem or base(root) Stemmer algorithms: Porter, Lancaster, Snowball
- Lemmatization: grouping indlected teprs of words so they can be analyzed as one e.g. saw (cut) and saw (seen)
- POS tagging: indicate the Part-Of-Speech
- Information Retrieval: NER (Named Entity Recognition)


In [1]:
# Let's import the nltk library and read the file
import nltk

In [2]:
with open('brown_corpus_ca10.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')  # to make sure there are no \ inbetween, we replaced it with space
    
# let's look at the data once
data2 = data.replace('/','')  # replace '/np' with '' to get the final result in line cell. I changed this back to '/' after getting final result.

In [3]:
for i, line in enumerate(data2.split('\n')):
    if i > 10:
        break
    print(str(i) + ':\t' + line)

0:		Vincent G. Ierulli has/hvz been/ben appointed/vbn temporary/jj assistant/nn district/nn attorney/nn ,/, it/pps was/bedz announced/vbn Monday/nr by/in Charles E. Raymond ,/, District/nn-tl Attorney/nn-tl ./.	Ierulli will/md replace/vb Desmond D. Connall who/wps has/hvz been/ben called/vbn to/in active/jj military/jj service/nn but/cc is/bez expected/vbn back/rb on/in the/at job/nn by/in March 31/cd ./.	Ierulli ,/, 29/cd ,/, has/hvz been/ben practicing/vbg in/in Portland since/in November ,/, 1959/cd ./.He/pps is/bez a/at graduate/nn of/in Portland-tl University/nn-tl and/cc the/at Northwestern/jj-tl College/nn-tl of/in-tl Law/nn-tl ./.He/pps is/bez married/vbn and/cc the/at father/nn of/in three/cd children/nns ./.	Helping/vbg foreign/jj countries/nns to/to build/vb a/at sound/jj political/jj structure/nn is/bez more/ql important/jj than/cs aiding/vbg them/ppo economically/rb ,/, E. M. Martin ,/, assistant/nn secretary/nn of/in state/nn for/in economic/jj affairs/nns told/vbd member

### Tokenization

In [4]:
from nltk import sent_tokenize, word_tokenize
sent_tokenize(data2)

['\tVincent G. Ierulli has/hvz been/ben appointed/vbn temporary/jj assistant/nn district/nn attorney/nn ,/, it/pps was/bedz announced/vbn Monday/nr by/in Charles E. Raymond ,/, District/nn-tl Attorney/nn-tl ./.',
 'Ierulli will/md replace/vb Desmond D. Connall who/wps has/hvz been/ben called/vbn to/in active/jj military/jj service/nn but/cc is/bez expected/vbn back/rb on/in the/at job/nn by/in March 31/cd ./.',
 'Ierulli ,/, 29/cd ,/, has/hvz been/ben practicing/vbg in/in Portland since/in November ,/, 1959/cd ./.He/pps is/bez a/at graduate/nn of/in Portland-tl University/nn-tl and/cc the/at Northwestern/jj-tl College/nn-tl of/in-tl Law/nn-tl ./.He/pps is/bez married/vbn and/cc the/at father/nn of/in three/cd children/nns ./.',
 'Helping/vbg foreign/jj countries/nns to/to build/vb a/at sound/jj political/jj structure/nn is/bez more/ql important/jj than/cs aiding/vbg them/ppo economically/rb ,/, E. M. Martin ,/, assistant/nn secretary/nn of/in state/nn for/in economic/jj affairs/nns tol

Similarly after applying sentence tokenizer,the resulting output shows all individual words/tokens.

In [5]:
for sent in sent_tokenize(data2):
    print(word_tokenize(sent))

['Vincent', 'G.', 'Ierulli', 'has/hvz', 'been/ben', 'appointed/vbn', 'temporary/jj', 'assistant/nn', 'district/nn', 'attorney/nn', ',', '/', ',', 'it/pps', 'was/bedz', 'announced/vbn', 'Monday/nr', 'by/in', 'Charles', 'E.', 'Raymond', ',', '/', ',', 'District/nn-tl', 'Attorney/nn-tl', './', '.']
['Ierulli', 'will/md', 'replace/vb', 'Desmond', 'D.', 'Connall', 'who/wps', 'has/hvz', 'been/ben', 'called/vbn', 'to/in', 'active/jj', 'military/jj', 'service/nn', 'but/cc', 'is/bez', 'expected/vbn', 'back/rb', 'on/in', 'the/at', 'job/nn', 'by/in', 'March', '31/cd', './', '.']
['Ierulli', ',', '/', ',', '29/cd', ',', '/', ',', 'has/hvz', 'been/ben', 'practicing/vbg', 'in/in', 'Portland', 'since/in', 'November', ',', '/', ',', '1959/cd', './.He/pps', 'is/bez', 'a/at', 'graduate/nn', 'of/in', 'Portland-tl', 'University/nn-tl', 'and/cc', 'the/at', 'Northwestern/jj-tl', 'College/nn-tl', 'of/in-tl', 'Law/nn-tl', './.He/pps', 'is/bez', 'married/vbn', 'and/cc', 'the/at', 'father/nn', 'of/in', 'three/c

### Stopwords removal

nltk has it's own list of stopwords. We can check the list of stop words using stopwords.words()

In [6]:
# import the stopwords library. 
from nltk.corpus import stopwords
stopwords_en = stopwords.words('english')

Map the lowercase string with our list of word tokens

In [7]:
single_tokenized_lowered = list(map(str.lower, word_tokenize(data2)))
print(single_tokenized_lowered)

['vincent', 'g.', 'ierulli', 'has/hvz', 'been/ben', 'appointed/vbn', 'temporary/jj', 'assistant/nn', 'district/nn', 'attorney/nn', ',', '/', ',', 'it/pps', 'was/bedz', 'announced/vbn', 'monday/nr', 'by/in', 'charles', 'e.', 'raymond', ',', '/', ',', 'district/nn-tl', 'attorney/nn-tl', './', '.', 'ierulli', 'will/md', 'replace/vb', 'desmond', 'd.', 'connall', 'who/wps', 'has/hvz', 'been/ben', 'called/vbn', 'to/in', 'active/jj', 'military/jj', 'service/nn', 'but/cc', 'is/bez', 'expected/vbn', 'back/rb', 'on/in', 'the/at', 'job/nn', 'by/in', 'march', '31/cd', './', '.', 'ierulli', ',', '/', ',', '29/cd', ',', '/', ',', 'has/hvz', 'been/ben', 'practicing/vbg', 'in/in', 'portland', 'since/in', 'november', ',', '/', ',', '1959/cd', './.he/pps', 'is/bez', 'a/at', 'graduate/nn', 'of/in', 'portland-tl', 'university/nn-tl', 'and/cc', 'the/at', 'northwestern/jj-tl', 'college/nn-tl', 'of/in-tl', 'law/nn-tl', './.he/pps', 'is/bez', 'married/vbn', 'and/cc', 'the/at', 'father/nn', 'of/in', 'three/cd'

Let's remove the stopwords using the english stopwords in nltk.

We will use set as it is faster in python than a list

In [8]:
stopwords_en = set(stopwords.words('english'))

print([word for word in single_tokenized_lowered if word not in stopwords_en])

['vincent', 'g.', 'ierulli', 'has/hvz', 'been/ben', 'appointed/vbn', 'temporary/jj', 'assistant/nn', 'district/nn', 'attorney/nn', ',', '/', ',', 'it/pps', 'was/bedz', 'announced/vbn', 'monday/nr', 'by/in', 'charles', 'e.', 'raymond', ',', '/', ',', 'district/nn-tl', 'attorney/nn-tl', './', '.', 'ierulli', 'will/md', 'replace/vb', 'desmond', 'd.', 'connall', 'who/wps', 'has/hvz', 'been/ben', 'called/vbn', 'to/in', 'active/jj', 'military/jj', 'service/nn', 'but/cc', 'is/bez', 'expected/vbn', 'back/rb', 'on/in', 'the/at', 'job/nn', 'by/in', 'march', '31/cd', './', '.', 'ierulli', ',', '/', ',', '29/cd', ',', '/', ',', 'has/hvz', 'been/ben', 'practicing/vbg', 'in/in', 'portland', 'since/in', 'november', ',', '/', ',', '1959/cd', './.he/pps', 'is/bez', 'a/at', 'graduate/nn', 'of/in', 'portland-tl', 'university/nn-tl', 'and/cc', 'the/at', 'northwestern/jj-tl', 'college/nn-tl', 'of/in-tl', 'law/nn-tl', './.he/pps', 'is/bez', 'married/vbn', 'and/cc', 'the/at', 'father/nn', 'of/in', 'three/cd'

By removing all stopwords from the text we obtain above.

Often, you want to remove the punctuation from the documents too since python comes with batteries included, we have string punctuations:

In [9]:
from string import punctuation
print('From string.punctuation:', type(punctuation), punctuation)

From string.punctuation: <class 'str'> !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Combining the punctuation with the stopwords from nltk

In [10]:
stopwords_en_withpunc = stopwords_en.union(set(punctuation))
print(stopwords_en_withpunc)

{'at', 'about', "don't", 'ain', 'will', 'isn', "isn't", 'a', "that'll", 'yourself', 'again', "should've", 'yourselves', 'those', 'as', '=', 'own', 'then', 'only', "couldn't", 'was', 'from', 'once', 'where', 'are', '[', 'which', "mustn't", 'do', 'of', 'such', 'above', 'd', '\\', 'did', 'y', 'when', 'him', '<', '>', 'didn', '&', 'does', 'over', ']', 'there', 'its', "aren't", 'no', 'who', 'nor', "'", '#', 'because', 'during', "you'd", 'they', "shan't", 'if', 'i', 'few', '%', 'being', 'we', 'than', "haven't", '^', 'had', '?', 'should', '_', 'doing', '(', 'am', 'hers', 'same', 'by', "hadn't", 'before', 'm', 're', "hasn't", 'shan', '-', 's', 'an', "wasn't", 'won', 'myself', "won't", 'my', 'be', 'on', 'it', 'having', '@', '/', 'don', 'their', 'very', 'aren', 'll', '!', 'his', 'theirs', 'have', 'herself', '|', 'up', 'been', 'not', 'why', 'what', '+', 'were', ':', 'under', "doesn't", 'with', 'our', 'your', "shouldn't", 'me', '$', 'between', 'any', 'now', 'shouldn', 'each', 'out', 'the', 'has', 

Removing stopwords with punctuation

In [11]:
print([word for word in single_tokenized_lowered if word not in stopwords_en_withpunc])

['vincent', 'g.', 'ierulli', 'has/hvz', 'been/ben', 'appointed/vbn', 'temporary/jj', 'assistant/nn', 'district/nn', 'attorney/nn', 'it/pps', 'was/bedz', 'announced/vbn', 'monday/nr', 'by/in', 'charles', 'e.', 'raymond', 'district/nn-tl', 'attorney/nn-tl', './', 'ierulli', 'will/md', 'replace/vb', 'desmond', 'd.', 'connall', 'who/wps', 'has/hvz', 'been/ben', 'called/vbn', 'to/in', 'active/jj', 'military/jj', 'service/nn', 'but/cc', 'is/bez', 'expected/vbn', 'back/rb', 'on/in', 'the/at', 'job/nn', 'by/in', 'march', '31/cd', './', 'ierulli', '29/cd', 'has/hvz', 'been/ben', 'practicing/vbg', 'in/in', 'portland', 'since/in', 'november', '1959/cd', './.he/pps', 'is/bez', 'a/at', 'graduate/nn', 'of/in', 'portland-tl', 'university/nn-tl', 'and/cc', 'the/at', 'northwestern/jj-tl', 'college/nn-tl', 'of/in-tl', 'law/nn-tl', './.he/pps', 'is/bez', 'married/vbn', 'and/cc', 'the/at', 'father/nn', 'of/in', 'three/cd', 'children/nns', './', 'helping/vbg', 'foreign/jj', 'countries/nns', 'to/to', 'build

### Stemming and Lemmatization

We will use stemming and lemmatization to reduce words to their root form.

In [12]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

# printing the stem words
for word in single_tokenized_lowered:
    print(porter.stem(word))

vincent
g.
ierulli
has/hvz
been/ben
appointed/vbn
temporary/jj
assistant/nn
district/nn
attorney/nn
,
/
,
it/pp
was/bedz
announced/vbn
monday/nr
by/in
charl
e.
raymond
,
/
,
district/nn-tl
attorney/nn-tl
./
.
ierulli
will/md
replace/vb
desmond
d.
connal
who/wp
has/hvz
been/ben
called/vbn
to/in
active/jj
military/jj
service/nn
but/cc
is/bez
expected/vbn
back/rb
on/in
the/at
job/nn
by/in
march
31/cd
./
.
ierulli
,
/
,
29/cd
,
/
,
has/hvz
been/ben
practicing/vbg
in/in
portland
since/in
novemb
,
/
,
1959/cd
./.he/pp
is/bez
a/at
graduate/nn
of/in
portland-tl
university/nn-tl
and/cc
the/at
northwestern/jj-tl
college/nn-tl
of/in-tl
law/nn-tl
./.he/pp
is/bez
married/vbn
and/cc
the/at
father/nn
of/in
three/cd
children/nn
./
.
helping/vbg
foreign/jj
countries/nn
to/to
build/vb
a/at
sound/jj
political/jj
structure/nn
is/bez
more/ql
important/jj
than/c
aiding/vbg
them/ppo
economically/rb
,
/
,
e.
m.
martin
,
/
,
assistant/nn
secretary/nn
of/in
state/nn
for/in
economic/jj
affairs/nn
told/vbd
member

fundamentalism/nn
in/in
these/dt
modern/jj
days/nn
and/cc
has/hvz
,
/
,
without/in
compromise/nn
,
/
,
stood/vbd
for/in
the/at
great/jj
truths/nn
of/in
the/at
bibl
for/in
which/wdt
men/nn
in/in
the/at
past/nn
have/hv
been/ben
willing/jj
to/to
give/vb
their/pp
$
lives/nn
``
/
''
./.new/jj-hl
point/nn-hl
added/vbn-hl
many/ap
changes/nn
involved/vbd
minor/jj
editing/nn
and/cc
clarification/nn
;
/
.
;
/.however/wrb
,
/
,
the/at
first/od
belief/nn
stood/vbd
for/in
entire/jj
revision/nn
with/in
a/at
new/jj
third/od
point/nn
added/vbn
to/in
the/at
list/nn
./
.
the/at
first/od
of/in
16/cd
beliefs/nn
of/in
the/at
denomination/nn
,
/
,
now/rb
reads/vbz
:
/
:
``
/
``
the/at
scriptures/nn
,
/
,
both/abx
old/jj-tl
and/cc
new/jj-tl
testament/nn-tl
,
/
,
are/b
verbally/rb
inspired/vbn
of/in
god
and/cc
are/b
the/at
revelation/nn
of/in
god
to/in
man/nn
,
/
,
the/at
infallible/jj
,
/
,
authoritative/jj
rule/nn
of/in
faith/nn
and/cc
conduct/nn
``
/
''
./
.
the/at
third/od
belief/nn
,
/
,
in/in
six/cd
poi

In [13]:
# import the wordnet lemmatizer from nltk.stem
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [14]:
for word in single_tokenized_lowered:
    print(wnl.lemmatize(word))

vincent
g.
ierulli
has/hvz
been/ben
appointed/vbn
temporary/jj
assistant/nn
district/nn
attorney/nn
,
/
,
it/pps
was/bedz
announced/vbn
monday/nr
by/in
charles
e.
raymond
,
/
,
district/nn-tl
attorney/nn-tl
./
.
ierulli
will/md
replace/vb
desmond
d.
connall
who/wps
has/hvz
been/ben
called/vbn
to/in
active/jj
military/jj
service/nn
but/cc
is/bez
expected/vbn
back/rb
on/in
the/at
job/nn
by/in
march
31/cd
./
.
ierulli
,
/
,
29/cd
,
/
,
has/hvz
been/ben
practicing/vbg
in/in
portland
since/in
november
,
/
,
1959/cd
./.he/pps
is/bez
a/at
graduate/nn
of/in
portland-tl
university/nn-tl
and/cc
the/at
northwestern/jj-tl
college/nn-tl
of/in-tl
law/nn-tl
./.he/pps
is/bez
married/vbn
and/cc
the/at
father/nn
of/in
three/cd
children/nns
./
.
helping/vbg
foreign/jj
countries/nns
to/to
build/vb
a/at
sound/jj
political/jj
structure/nn
is/bez
more/ql
important/jj
than/cs
aiding/vbg
them/ppo
economically/rb
,
/
,
e.
m.
martin
,
/
,
assistant/nn
secretary/nn
of/in
state/nn
for/in
economic/jj
affairs/nns
to

We also need to evaluate the POS tag for each token

### POS tag

In [15]:
stop_words = set(stopwords.words('english'))
tokenized = sent_tokenize(data2)
for i in tokenized:
    wordlist = nltk.word_tokenize(i)
    wordslist = [w for w in wordlist if not w in stop_words]
    # now tag the tokens accordingly using the POS tags
    tagged = nltk.pos_tag(wordlist)
    
    print(tagged)

[('Vincent', 'NNP'), ('G.', 'NNP'), ('Ierulli', 'NNP'), ('has/hvz', 'NN'), ('been/ben', 'NN'), ('appointed/vbn', 'NN'), ('temporary/jj', 'NN'), ('assistant/nn', 'NN'), ('district/nn', 'NN'), ('attorney/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('it/pps', 'NN'), ('was/bedz', 'NN'), ('announced/vbn', 'NN'), ('Monday/nr', 'NNP'), ('by/in', 'NN'), ('Charles', 'NNP'), ('E.', 'NNP'), ('Raymond', 'NNP'), (',', ','), ('/', 'NNP'), (',', ','), ('District/nn-tl', 'NNP'), ('Attorney/nn-tl', 'NNP'), ('./', 'NNP'), ('.', '.')]
[('Ierulli', 'NNP'), ('will/md', 'NN'), ('replace/vb', 'NN'), ('Desmond', 'NNP'), ('D.', 'NNP'), ('Connall', 'NNP'), ('who/wps', 'VBZ'), ('has/hvz', 'JJ'), ('been/ben', 'NN'), ('called/vbn', 'NN'), ('to/in', 'NN'), ('active/jj', 'NN'), ('military/jj', 'NN'), ('service/nn', 'NN'), ('but/cc', 'NN'), ('is/bez', 'NN'), ('expected/vbn', 'NN'), ('back/rb', 'NN'), ('on/in', 'NN'), ('the/at', 'NN'), ('job/nn', 'NN'), ('by/in', 'IN'), ('March', 'NNP'), ('31/cd', 'CD'), ('./', 

[('Denials/nns', 'NNP'), ('were/bed', 'VBD'), ('of/in', 'JJ'), ('motions/nns', 'NN'), ('of/in', 'NN'), ('dismissal/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('continuance/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('mistrial/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('separate/jj', 'NN'), ('trial/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('acquittal/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('striking/nn', 'NN'), ('of/in', 'NN'), ('testimony/nn', 'NN'), ('and/cc', 'NN'), ('directed/vbn', 'NN'), ('verdict/nn', 'NN'), ('./', 'NNP'), ('.', '.')]
[('In/in', 'NNP'), ('denying/vbg', 'NN'), ('motions/nns', 'NN'), ('for/in', 'NN'), ('dismissal/nn', 'NN'), (',', ','), ('/', 'NNP'), (',', ','), ('Judge/nn-tl', 'NNP'), ('Powell', 'NNP'), ('stated/vbd', 'VBZ'), ('that/cs', 'JJ'), ('mass/nn', 'NN'), ('trials/nns', 'NN'), ('have/hv', 'NN'), ('been/ben', 'NN'), ('upheld/vbn', 'JJ'), ('as/cs', 'NN'), ('proper/jj', 'NN'), ('in/in', 'NN'), ('other/ap', 'NN'), ('cou

For the final text processing task, we will apply Named Entity Recognition (NER)

In [16]:
sentences = nltk.sent_tokenize(data2)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []
    
    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))
    return entity_names

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))
    
print(set(entity_names))    
        

{'Schwab', 'Portland', 'Eugene', 'James Culbertson', 'Frank Lee', 'Martin', 'Roosevelt', 'Charles', 'Sexton', 'Blaine Whipple', 'Richard Mears', 'Miles', 'Donald Huffman', 'James', 'Weinstein', 'Illinois', 'Mr. Brandt', 'U.S.', 'Washington', 'Republican', 'Vincent', 'Soviet', 'Philip Weinstein', 'Salem', 'Mr. Zimmerman', 'Los Angeles', 'Richard Salter', 'Hatfield', 'Dean Bryson', 'Oregon', 'Ierulli', 'Dwight M. Steeves', 'Americanss', 'Jack Lowe', 'Barbara Njust'}


If we set the parameter binary = True, then Named Entities are tagged as NE otherwise the classifier adds labels such as person, organization, etc

We created a function and then extracted and stored the entity names in the empty list.

Again we set the entity_names as an empty list and will extract the entity_names by iterating over each tree in chunked_sentences 