# Entity Extraction

In this example, we will use NLTK for entity extraction. 
- Firstly, install python environment
- Change the source mirror of pip : https://mirrors.bfsu.edu.cn/help/pypi/
- Install NLTK: pip install nltk
- Download data distribution for NLTK. Enter python terminal first. import nltk. Install packages by using NLTK downloader: ``nltk.download()``. If cannot download using ``nltk.download()``, try download manually from https://github.com/nltk/nltk_data/tree/gh-pages![image.png](attachment:image.png) or https://pan.baidu.com/s/1wONWpaa86_wnsIksKda8eQ (code:tfon )
- Unzip the downloaded file to the following folder: ``nltk.data.find(".")``
- Unzip each zip file in the ten folders: *chunkers, corpora, grammers, help, misc, models, sentiment, stemmers, taggers, tokenizers*

In [1]:
# import all packages 
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import Tree

In [2]:
# Tokenize sentence:
raw = """John was born in Liverpool, to Julia Lennon and Alfred Lennon"""
tokens = word_tokenize(raw)
tokens

['John',
 'was',
 'born',
 'in',
 'Liverpool',
 ',',
 'to',
 'Julia',
 'Lennon',
 'and',
 'Alfred',
 'Lennon']

In [3]:
# pos-tag of inputs
tagged = nltk.pos_tag(tokens)
print(tagged)

[('John', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('Liverpool', 'NNP'), (',', ','), ('to', 'TO'), ('Julia', 'NNP'), ('Lennon', 'NNP'), ('and', 'CC'), ('Alfred', 'NNP'), ('Lennon', 'NNP')]


If you want to know the detail information of each tag, use the following statements:

In [4]:
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


### Chunking:
- Use ``ne_chunk`` provided by NLTK. ``ne_chunk`` needs part-of-speech annotations to add ``NE`` labels to the sentence. The output of the ``ne_chunk`` is a ``nltk.Tree`` object
- ``ne_chunk`` produces 2-level trees:
 - Nodes on Level-1: outsides any chunk
 - Nodes on Level-2: inside a chunk (the label of the chunk is denoted by the label of the subtree)


In [5]:
chunks = ne_chunk(pos_tag(word_tokenize(raw)))
print(chunks)
chunks.draw()

(S
  (PERSON John/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Liverpool/NNP)
  ,/,
  to/TO
  (PERSON Julia/NNP Lennon/NNP)
  and/CC
  (PERSON Alfred/NNP Lennon/NNP))


Traverse the chunked tree structure to get each chunk and words inside each chunk:

In [6]:
for i in chunks:
    print(i, type(i))
    if type(i) == Tree:
        print('Chunk detect!')
        chunk_phrase = []
        for token,pos in i.leaves():
            print(token,pos)

(PERSON John/NNP) <class 'nltk.tree.tree.Tree'>
Chunk detect!
John NNP
('was', 'VBD') <class 'tuple'>
('born', 'VBN') <class 'tuple'>
('in', 'IN') <class 'tuple'>
(GPE Liverpool/NNP) <class 'nltk.tree.tree.Tree'>
Chunk detect!
Liverpool NNP
(',', ',') <class 'tuple'>
('to', 'TO') <class 'tuple'>
(PERSON Julia/NNP Lennon/NNP) <class 'nltk.tree.tree.Tree'>
Chunk detect!
Julia NNP
Lennon NNP
('and', 'CC') <class 'tuple'>
(PERSON Alfred/NNP Lennon/NNP) <class 'nltk.tree.tree.Tree'>
Chunk detect!
Alfred NNP
Lennon NNP


## Exercise1
Extract all named entities as well as its type/label

In [7]:
# Exercise1, define a function to extract all named enties together with labels
def get_labeled_chunks(text):
    # your implementation
    # chunk the sentense and assign a POS tag.
    chunks = ne_chunk(pos_tag(word_tokenize(text)), binary = False)
    label_entities = {}
    # concat word entities with their tags.
    for i in chunks:
        if type(i) == Tree:
            # concat noun phrases if it is.
            # np = " ".join([WordNetLemmatizer().lemmatize(a[0]) for a in i])
            np = " ".join(a[0] for a in i)
            # append items to label_entities dictionary.
            label_entities[np] = i.label()
    return label_entities
get_labeled_chunks(raw)

#output result:
#{'John': 'PERSON',
# 'Liverpool': 'GPE',
# 'Julia Kim': 'PERSON',
# 'Alfred Lennon': 'PERSON'}

{'John': 'PERSON',
 'Liverpool': 'GPE',
 'Julia Lennon': 'PERSON',
 'Alfred Lennon': 'PERSON'}

### Exercise2
Extract only *PERSON* entities

In [8]:
# Exercise2, extract all the entities of specific type
def get_type_chunks(text, label):
    # your implementation
    # chunk the sentense and assign a POS tag.
    chunks = ne_chunk(pos_tag(word_tokenize(text)))
    entity = []
    for i in chunks:
        if type(i) == Tree and str(i.label()) == label:
            # concat noun phrases if it is.
            # np = " ".join([WordNetLemmatizer().lemmatize(a[0]) for a in i])
            np = " ".join(a[0] for a in i)
            # append items to entity list.
            entity.append(np)
    return entity
get_type_chunks(raw,'PERSON')
#output result:
#['John', Julia Lennon', 'Alfred Lennon']

['John', 'Julia Lennon', 'Alfred Lennon']

### Exercise3: Noun phrase chunking
Define your own grammer for noun phrase chunking using ``nltk.RegexpParser``

In [9]:
from nltk.stem import WordNetLemmatizer
def np_chunking(sentence):
    # define grammar to extract noun phrases.
    grammer = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}"  # chunker rule(s), try think of more rules
    # your implementation
    cp = nltk.RegexpParser(grammer)
    # chunk the sentense and assign a POS tag as well as grammar.
    chunks = cp.parse(pos_tag(word_tokenize(sentence)))
    entity = []
    for i in chunks:
        if type(i) == Tree:
            # concat noun phrases if it is.
            # np = " ".join([WordNetLemmatizer().lemmatize(a[0]) for a in i])
            np = " ".join(a[0] for a in i)
            # append items to entity list.
            entity.append(np)
    return entity

print(np_chunking("""the little dog barked at the cat"""))
print(np_chunking("""Jonh was born in Liverpool, to Julia Kim and Alfred Lennon"""))

#output result:
#['little dog', 'cat']
#['Jonh', 'Liverpool', 'Julia Kim', 'Alfred Lennon']

['little dog', 'cat']
['Jonh', 'Liverpool', 'Julia Kim', 'Alfred Lennon']


In [10]:
# Reference: https://www.cnblogs.com/chen8023miss/p/11458571.html.

: 