# Text mining

In this task we will use `nltk` package to recognize [named entities](https://en.wikipedia.org/wiki/Named-entity_recognition) and classify in a given text (in this case [article](https://en.wikipedia.org/wiki/American_Revolution) about American Revolution from Wikipedia).

`nltk.ne_chunk` function can be used for both recognition and classification of named entities. We will aslo implement custom NER function to recognize entities, and custom function to classify named entities using their Wikipedia articles.

In [248]:
import nltk
import numpy as np
import wikipedia
import re

Suppress `wikipedia` package warnings.

In [187]:
import warnings
warnings.filterwarnings('ignore')

Helper functions to process output of `nltk.ne_chunk` and to count frequency of named entities in a given text.

In [None]:
def count_entites(entity, text):
    s = entity
    
    if type(entity) is tuple:
        s = entity[0]
    
    return len(re.findall(s, text))

def get_top_n(entities, text, n):
    a = [ (e, count_entites(e, text)) for e in entities]
    a.sort(key=lambda x: x[1], reverse=True)
    return a[0:n]

# For a list of entities found by nltk.ne_chunks:
# returns (entity, label) if it is a single word or
# concatenates multiple word named entities into single string
def get_entity(entity):
    if isinstance(entity, tuple) and entity[1][:2] == 'NE':
        return entity
    if isinstance(entity, nltk.tree.Tree):
        text = ' '.join([word for word, tag in entity.leaves()])
        return (text, entity.label())
    return None

Since `nltk.ne_chunks` tends to put same named entities into more classes (like 'American' : 'ORGANIZATION' and 'American' : 'GPE'), we would want to filter these duplicities.

In [256]:
# returns list of named entities in a form [(entity_text, entity_label), ...]
def extract_entities(chunk):
    data = []

    for entity in chunk:
        d = get_entity(entity)
        if d is not None and d[0] not in [e[0] for e in data]:
            data.append(d)

    return data

Our custom NER function.

In [257]:
def custom_NER(tagged):
    entities = []
    
    entity = []
    for word in tagged:
        if word[1][:2] == 'NN' or (entity and word[1][:2] == 'IN'):
            entity.append(word)
        else:
            if entity and entity[-1][1].startswith('IN'):
                entity.pop()
            if entity:
                s = ' '.join(e[0] for e in entity)
                if s not in entities and s[0].isupper():
                    entities.append(s)
            entity = []
    return entities

Loading processed article, approximately 500 sentences. Regex substitution removes reference links (e.g. [12])

In [254]:
text = None
with open('text', 'r') as f:
    text = f.read()
    
text = re.sub(r'\[[0-9]*\]', '', text)

Now we try to recognize entities with both `nltk.ne_chunk` and our `custom_NER` function and print 10 most frequent entities.

Yielded results seem to be fairly similar. `nltk.ne_chunk` function also added basic classification [tags](http://www.nltk.org/book/ch07.html#tab-ne-types).

In [258]:
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)

ne_chunked = nltk.ne_chunk(tagged, binary=False)
ex = extract_entities(ne_chunked)
ex_custom = custom_NER(tagged)

top_ex = get_top_n(ex, text, 10)
top_ex_custom = get_top_n(ex_custom, text, 10)
print('ne_chunked:')
for e in top_ex:
    print('{} count: {}'.format(e[0], e[1]))
print()
print('custom NER:')
for e in top_ex_custom:
    print('{} count: {}'.format(e[0], e[1]))

ne_chunked:
('British', 'GPE') count: 154
('America', 'GPE') count: 145
('American', 'GPE') count: 130
('New', 'ORGANIZATION') count: 51
('Loyalist', 'GPE') count: 46
('Americans', 'GPE') count: 44
('Britain', 'GPE') count: 40
('Patriot', 'GPE') count: 38
('Revolution', 'ORGANIZATION') count: 38
('Loyalists', 'ORGANIZATION') count: 37

custom NER:
A count: 277
British count: 154
America count: 145
Loyalist count: 46
Americans count: 44
Britain count: 40
Revolution count: 38
Patriot count: 38
Loyalists count: 37
Congress count: 35


In [245]:
def get_noun_phrase(entity, sentence):
    t = nltk.pos_tag([word for word in nltk.word_tokenize(sentence)])
    phrase = ''
    stage = 0
    for word in t:
        if word[0] in ('is', 'was', 'were', 'are') and stage == 0:
            stage = 1
            continue
        elif stage == 1:
            if word[1] in ('NN', 'JJ', 'VBD', 'CD', 'NNP', 'NNPS', 'RBS'):
                phrase = phrase + ' ' + word[0]
            elif word[1] in ('DT', ',', 'CC', 'IN'):
                continue
            else:
                break
        
                
    return {entity : phrase[1:]}

def get_wiki_desc(entity):
    try:
        page = wikipedia.page(entity)
    except wikipedia.DisambiguationError:
        return {entity : 'Thing'}
    
    fs = nltk.sent_tokenize(page.summary)[0]
    return get_noun_phrase(entity, fs)

In [246]:
for entity in top_ex:
    print(get_wiki_desc(entity[0][0]))

{'British': 'Thing'}
{'America': 'Thing'}
{'American': 'Thing'}
{'American': 'Thing'}
{'New': 'Thing'}
{'Loyalist': ''}
{'Loyalist': ''}
{'Americans': ''}
{'Americans': ''}
{'Britain': 'Thing'}


In [247]:
for entity in top_ex_custom:
    print(get_wiki_desc(entity[0]))

{'A': 'first letter first vowel ISO basic Latin alphabet'}
{'British': 'Thing'}
{'America': 'Thing'}
{'Loyalist': ''}
{'Americans': ''}
{'Britain': 'Thing'}
{'Revolution': 'fundamental change political power organizational'}
{'Patriot': 'Thing'}
{'Loyalists': ''}
{'Congress': 'formal meeting'}
